date:20190209

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Release Note: 
We can specify peerQuorumAddress instead of peerId in VerifyReplication tool. 
So it no longer requires peerId to be setup when using this tool.

For example:
hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
zk1,zk2,zk3:2181/hbase testTable


  was:
We can specify peerQuorumAddress instead of peerId in VerifyReplication tool. 
So it no longer requires peerId to be setup when using this tool.

For example:
hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase testTable



> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Release Note: 
We can specify peerQuorumAddress instead of peerId in VerifyReplication tool. 
So it no longer requires peerId to be setup when using this tool.

For example:
hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase testTable


> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764332#comment-16764332
 ] 

Nihal Jain commented on HBASE-21636:


Thank you [~stack] Sir. BTW I am unable to see this in commit log of master. 
Did you miss it?

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764325#comment-16764325
 ] 

Toshihiro Suzuki commented on HBASE-21201:
--

Pushed to master, branch-2, branch-2.1 and branch-2.2.

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Fix Version/s: 2.1.4
   2.3.0
   2.2.0
   3.0.0

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21866:
-
Description: 
By using the latest HBase master branch, the bug could be re-produced as:
 # create 't1', 'cf1'
 # create 't1', 'cf1'

The following message is logged into HMaster's log:
{code}
INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup null
{code}
This is a wrong action that we should keep t1 as where it originally is.

  was:
By using the latest HBase master branch, the bug could be re-produced as:
 # Create 't1', 'cf1'
 # Create 't1', 'cf1' again

The following message is logged into HMaster's log:
{code}
INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup null
{code}
This is a wrong action that we should keep t1 as where it originally is.


> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # create 't1', 'cf1'
>  # create 't1', 'cf1'
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21585) Remove ClusterConnection

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764317#comment-16764317
 ] 

stack commented on HBASE-21585:
---

[~Apache9] notes on rb

> Remove ClusterConnection
> 
>
> Key: HBASE-21585
> URL: https://issues.apache.org/jira/browse/HBASE-21585
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Attachments: HBASE-21585-HBASE-21512-v1.patch, 
> HBASE-21585-HBASE-21512-v2.patch, HBASE-21585-HBASE-21512-v3.patch, 
> HBASE-21585-HBASE-21512-v4.patch, HBASE-21585-HBASE-21512.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764313#comment-16764313
 ] 

Xiang Li edited comment on HBASE-21866 at 2/10/19 6:28 AM:
---

The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

I have no idea how to fix it yet. We might make it not to call 
cpHost.postDeleteTable() for some conditions? [~xucang], [~yuzhih...@gmail.com]

[~Apache9], CreateTableProcedure#executeFromState() returns Flow.NO_MORE_STATE 
when the table exists, as
{code}
case CREATE_TABLE_PRE_OPERATION:
  // Verify if we can create the table
  boolean exists = !prepareCreate(env);
  releaseSyncLatch();

  if (exists) {
assert isFailed() : "the delete should have an exception here";
return Flow.NO_MORE_STATE;
  }
{code}
Is it expected that roll-back is performed when NO_MORE_STATE ? I am new to 
those codes.


was (Author: water):
The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

I have no idea how to fix it yet. We might make it not to call 
cpHost.postDeleteTable() for some conditions? [~xucang][~yuzhih...@gmail.com]

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764313#comment-16764313
 ] 

Xiang Li edited comment on HBASE-21866 at 2/10/19 6:23 AM:
---

The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

I have no idea how to fix it yet. We might make it not to call 
cpHost.postDeleteTable() for some conditions? [~xucang][~yuzhih...@gmail.com]


was (Author: water):
The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

I have no idea how to fix it yet. 

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764313#comment-16764313
 ] 

Xiang Li edited comment on HBASE-21866 at 2/10/19 6:21 AM:
---

The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

I have no idea how to fix it yet. 


was (Author: water):
The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764314#comment-16764314
 ] 

Duo Zhang commented on HBASE-21844:
---

OK, the broken procedures when restarting is here.

Could you please see the logs for the orphan procedures? OpenRegionProcedure 
can only be the sub procedure of TRSP, is the TRSP finished before restarting?

> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764313#comment-16764313
 ] 

Xiang Li edited comment on HBASE-21866 at 2/10/19 6:12 AM:
---

The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}
The postDeleteTable() calls RSGroupAdminEndpoint#postDeleteTable() to move the 
table to null rsgroup.


was (Author: water):
The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764313#comment-16764313
 ] 

Xiang Li commented on HBASE-21866:
--

The following code might be the cause:
{code:title=hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/CreateTableProcedure.java|borderStyle=solid}
  @Override
  protected void rollbackState(final MasterProcedureEnv env, final 
CreateTableState state)
  throws IOException {
if (state == CreateTableState.CREATE_TABLE_PRE_OPERATION) {
  // nothing to rollback, pre-create is just table-state checks.
  // We can fail if the table does exist or the descriptor is malformed.
  // TODO: coprocessor rollback semantic is still undefined.
  DeleteTableProcedure.deleteTableStates(env, getTableName());

  final MasterCoprocessorHost cpHost = env.getMasterCoprocessorHost();
  if (cpHost != null) {
cpHost.postDeleteTable(getTableName());
  }

  releaseSyncLatch();
  return;
}
{code}

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21866:
-
Component/s: (was: proc-v2)

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764312#comment-16764312
 ] 

Duo Zhang commented on HBASE-21788:
---

Have you checked the log at rs side? Did meta actually open on the rs?

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764309#comment-16764309
 ] 

stack commented on HBASE-21844:
---

Corruption will mess us up (You fellows on a stock HDFS?).

> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21866:
-
Description: 
By using the latest HBase master branch, the bug could be re-produced as:
 # Create 't1', 'cf1'
 # Create 't1', 'cf1' again

The following message is logged into HMaster's log:
{code}
INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup null
{code}
This is a wrong action that we should keep t1 as where it originally is.

  was:
By using the latest HBase master branch, the bug could be re-produced as:
 # Create 't1', 'cf1'
 # Create 't1', 'cf1' again

The following message is logged into HMaster's log:
{code:java}
INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
null{code}


> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2, rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null
> {code}
> This is a wrong action that we should keep t1 as where it originally is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21866:
-
Description: 
By using the latest HBase master branch, the bug could be re-produced as:
 # Create 't1', 'cf1'
 # Create 't1', 'cf1' again

The following message is logged into HMaster's log:
{code:java}
INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
null{code}

> Move the table to null rsgroup when creating a existing table
> -
>
> Key: HBASE-21866
> URL: https://issues.apache.org/jira/browse/HBASE-21866
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2, rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> By using the latest HBase master branch, the bug could be re-produced as:
>  # Create 't1', 'cf1'
>  # Create 't1', 'cf1' again
> The following message is logged into HMaster's log:
> {code:java}
> INFO  [PEWorker-12] rsgroup.RSGroupAdminServer: Moving table t1 to RSGroup 
> null{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21866) Move the table to null rsgroup when creating a existing table

2019-02-09 Thread Xiang Li (JIRA)

Xiang Li created HBASE-21866:


 Summary: Move the table to null rsgroup when creating a existing 
table
 Key: HBASE-21866
 URL: https://issues.apache.org/jira/browse/HBASE-21866
 Project: HBase
  Issue Type: Bug
  Components: proc-v2, rsgroup
Reporter: Xiang Li
Assignee: Xiang Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764301#comment-16764301
 ] 

stack commented on HBASE-21788:
---

24hrs, 47.379sec is interesting. Did you check the UI lock and procedures page? 
It can help figuring procedure state and who has locks on what. Would have been 
good to see if pid=32701 was still around or what state it was in.

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved

2019-02-09 Thread stack (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reassigned HBASE-21788:
-

Assignee: stack

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764294#comment-16764294
 ] 

stack commented on HBASE-21862:
---

[~Apache9] Thanks. I don't see where that is done. I see this for 
regionServerReport:
{code}
  @VisibleForTesting
  public void regionServerReport(ServerName sn,
ServerMetrics sl) throws YouAreDeadException {
checkIsDead(sn, "REPORT");
if (null == this.onlineServers.replace(sn, sl)) {
  // Already have this host+port combo and its just different start code?
  // Just let the server in. Presume master joining a running cluster.
  // recordNewServer is what happens at the end of reportServerStartup.
  // The only thing we are skipping is passing back to the regionserver
  // the ServerName to use. Here we presume a master has already done
  // that so we'll press on with whatever it gave us for ServerName.
  if (!checkAndRecordNewServer(sn, sl)) {
LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
return; // Not recorded, so no need to move on
  }
}
updateLastFlushedSequenceIds(sn, sl);
  }
{code}

Pardon my being dumb

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
>

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764297#comment-16764297
 ] 

stack commented on HBASE-21636:
---

Pushed on all branches but 2.1 until Duo is done w/ the RC. Leaving open till 
then.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21636:
--
Fix Version/s: 2.0.5
   2.2.0
   3.0.0

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21636:
--
Fix Version/s: 2.3.0

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764291#comment-16764291
 ] 

Duo Zhang commented on HBASE-21862:
---

In regionServerReport we will check the reported regions and check whether it 
is actually on the rs in the records of the AssignmentManager. In the past if 
there are differences we will kill the rs, and we changed the behavior to only 
warn it since it may kill good region servers.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764289#comment-16764289
 ] 

stack commented on HBASE-21862:
---

bq. This is removed IIRC. The problem here is that, there is no fence between 
regionServerReport and reportTransition, so there could be race and master will 
kill a good RS, only because the reportTransition is faster than the 
regionServerReport.

regionServerReport doesn't seem to do anything related to AMv2 Not looking 
at git history... so I'm not sure how we have a race here. [~sershe] talks 
about a similar 'absent' functionality... so I'm missing something here, 
mis-remembering.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
>

[jira] [Updated] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21819:
--
Status: Patch Available  (was: Reopened)

> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764282#comment-16764282
 ] 

Hadoop QA commented on HBASE-21819:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} branch-2.1 Compile Tests {color} ||
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}  0m 40s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:42ca976 |
| JIRA Issue | HBASE-21819 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958151/HBASE-21819-branch-2.1-addendum-v3.patch
 |
| Optional Tests |  dupname  asflicense  |
| uname | Linux 719477462ad5 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2.1 / 54f9afba04 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Max. process+thread count | 48 (vs. ulimit of 1) |
| modules | C: . U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15913/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang reopened HBASE-21819:
---

> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21819:
--
Attachment: HBASE-21819-branch-2.1-addendum-v3.patch

> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21819:
--
Attachment: HBASE-21819-branch-2.1-addendum-v3.patch

> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21819:
--
Attachment: (was: HBASE-21819-branch-2.1-addendum-v3.patch)

> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+. Thanks all for analyzing and reviewing.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764275#comment-16764275
 ] 

Duo Zhang commented on HBASE-21862:
---

Let me commit.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764274#comment-16764274
 ] 

Hadoop QA commented on HBASE-21862:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
38s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 51s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
4s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 38m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958149/HBASE-21862-v2.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 2e459befa684 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15912/testReport/ |
| Max. process+thread count | 259 (vs. ulimit of 1) |
| modules | C: hbase-client U: hbase-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15912/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This

[jira] [Created] (HBASE-21865) Put up 2.1.3RC1

2019-02-09 Thread Duo Zhang (JIRA)

Duo Zhang created HBASE-21865:
-

 Summary: Put up 2.1.3RC1
 Key: HBASE-21865
 URL: https://issues.apache.org/jira/browse/HBASE-21865
 Project: HBase
  Issue Type: Sub-task
  Components: release
Reporter: Duo Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: HBASE-21862-v2.patch

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764268#comment-16764268
 ] 

Duo Zhang commented on HBASE-21862:
---

{quote}
I thought the Master was doing this now. It is supposed to kill the RS that 
disagrees with its map of where Regions are deployed. Did that not happen here?
{quote}

This is removed IIRC. The problem here is that, there is no fence between 
regionServerReport and reportTransition, so there could be race and master will 
kill a good RS, only because the reportTransition is faster than the 
regionServerReport.

And why I keep saying the above words again and again is that, people usually 
think we will wait for the result in RSProcedureDispatcher but this is not the 
truth... We do not do expensive operations in executeProcedures, so the problem 
here is not the same with what we meet in 1.x, where a openRegion rpc call may 
timeout due to a slow region open...

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
>

[jira] [Commented] (HBASE-21864) add region state version and reinstate YouAreDead exception in region report

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764256#comment-16764256
 ] 

stack commented on HBASE-21864:
---

Trying to understand your scenario [~sershe]:

'send report' is the regular heartbeat from RS to M which includes currently 
open regions? Or is it something else?

Where is the "You shouldn't have R1, die" code? I can't see it in 
regionServerReport Its somewhere else?





> add region state version and reinstate YouAreDead exception in region report
> 
>
> Key: HBASE-21864
> URL: https://issues.apache.org/jira/browse/HBASE-21864
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> The state version will ensure we don't have network-related races  (e.g. the 
> one I reported in some other bug -
> {code}
> RS: send report {R1} ...
> M: close R1
> RS: I closed R1
> M ... receive report {R1}
> M: you shouldn't have R1, die
> {code}).
> Then we can revert the change that removed YouAreDead exception... RS in 
> incorrect state should be either brought into correct state or killed because 
> it means there's some bug; right now if double assignment happens (I found 2 
> different cases just this week ;)) master lets RS with incorrect assignment 
> keep it forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764255#comment-16764255
 ] 

Hadoop QA commented on HBASE-21636:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.0 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
51s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
11s{color} | {color:green} branch-2.0 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
44s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} rubocop {color} | {color:red}  0m 
15s{color} | {color:red} The patch generated 39 new + 386 unchanged - 24 fixed 
= 425 total (was 410) {color} |
| {color:orange}-0{color} | {color:orange} ruby-lint {color} | {color:orange}  
0m 16s{color} | {color:orange} The patch generated 42 new + 623 unchanged - 4 
fixed = 665 total (was 627) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  6m 
47s{color} | {color:green} hbase-shell in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 14m 16s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 |
| JIRA Issue | HBASE-21636 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958147/HBASE-21636.branch-2.0.001.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  rubocop  
ruby_lint  |
| uname | Linux 0d098023e79d 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2.0 / 9ef7c00c34 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| rubocop | v0.60.0 |
| rubocop | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15911/artifact/patchprocess/diff-patch-rubocop.txt
 |
| ruby-lint | v2.3.1 |
| ruby-lint | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15911/artifact/patchprocess/diff-patch-ruby-lint.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15911/testReport/ |
| Max. process+thread count | 2228 (vs. ulimit of 1) |
| modules | C: hbase-shell U: hbase-shell |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15911/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA

[jira] [Updated] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21636:
--
Attachment: HBASE-21636.branch-2.0.001.patch

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764250#comment-16764250
 ] 

stack commented on HBASE-21636:
---

[~nihaljain.cs] Ok sir. I fixed some of the robocop complaints in attached 
patch. Will push to branch-2.0+ (Ok by you [~Apache9]? I won't mess up your 
RC'ing?).

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764245#comment-16764245
 ] 

Nihal Jain commented on HBASE-21636:


But I think, it would be easier to do basic tests for readtypes, replica scans 
etc. So we may consider putting this in branch-2.0, branch-2.1 releases later.?

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764241#comment-16764241
 ] 

Nihal Jain commented on HBASE-21636:


{quote}Where should it go? Back to branch-2.0?
{quote}
Yes it can go to 2.0+
{quote}There are RCs being rolled on branch-2.0 and branch-2.1.
{quote}
Um, it should be fine to push it to 2.2+. 

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764237#comment-16764237
 ] 

stack commented on HBASE-21636:
---

[~nihaljain.cs] Pardon my not digesting the description. Let me push this. 
Where should it go? Back to branch-2.0?

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764237#comment-16764237
 ] 

stack edited comment on HBASE-21636 at 2/9/19 8:54 PM:
---

[~nihaljain.cs] Pardon my not digesting the description. Let me push this. 
Where should it go? Back to branch-2.0?

Or maybe since it an enhancement, just branch-2.2+? There are RCs being rolled 
on branch-2.0 and branch-2.1. 


was (Author: stack):
[~nihaljain.cs] Pardon my not digesting the description. Let me push this. 
Where should it go? Back to branch-2.0?

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764234#comment-16764234
 ] 

stack commented on HBASE-21863:
---

I took a look at the patch. On second part where you have rpc deadline, yeah, 
we've avoided going this route up to now as it compounds the possible states 
we'd have to deal with upping the possibility of double assign, a condition 
rare in amv2 (though you seem to have found a case but looks like there was a 
bug fixed over in HBASE-21862).

On this bit of the patch:

396   LOG.warn("Received report {} transition from {} for {}, pid={} 
but the region is not on it,"
397 + " killing RS", TransitionCode.OPENED, serverName, regionNode, 
getProcId());
398   // We may be killing an innocent RS due to some network race 
condition (to fix that, we'd
399   // need HBASE-21864). However, that is relatively harmless 
compared to HBASE-21862.
400   // Play it safe and assume we could have a double-assignment 
situation.
401   // Note that we don't do it in regular RS report, because races 
there are much more frequent.
402   throw new YouAreDeadException("Potentially double-assigning " + 
regionNode);

.. I think a version of this makes sense. Not sure if TRSP is the place for 
it as we may get the message though no waiting TRSP. I'd throw something other 
than a YADE, perhaps a more specific subclass, since YARDE has up to this had 
one usage. I'd also wait till we had an instance of a report from a RS that had 
an unaccounted Region opening... I'd like to know how it comes about first 
before building the handling.

Thanks.

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764233#comment-16764233
 ] 

Nihal Jain commented on HBASE-21636:


Limit added in HBASE-17045.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764229#comment-16764229
 ] 

stack commented on HBASE-21636:
---

Patch LGTM. Why remove this bit below?

611 if limit > 0 && count >= limit  
612   # If we reached the limit, exit before the next call to 
hasNext   
613   break 
614 end

Thanks.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764232#comment-16764232
 ] 

Nihal Jain commented on HBASE-21636:


See HBASE-13721

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764231#comment-16764231
 ] 

Nihal Jain commented on HBASE-21636:


Thanks for the review [~stack]. I think that logic was added when hbase didnot 
have a way to limit number of rows to be returned. Now we can achieve the same 
using scan.setLimit. Hence I made this change. I mentioned the same in issue 
description too.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20053) Remove .cmake file extension from .gitignore

2019-02-09 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764228#comment-16764228
 ] 

Hudson commented on HBASE-20053:


Results for branch HBASE-20053
[build #7 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/7/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/7//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/7//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/7//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Remove .cmake file extension from .gitignore
> 
>
> Key: HBASE-20053
> URL: https://issues.apache.org/jira/browse/HBASE-20053
> Project: HBase
>  Issue Type: Sub-task
>  Components: build, community
>Affects Versions: HBASE-14850
>Reporter: Ted Yu
>Assignee: Norbert Kalmar
>Priority: Minor
>  Labels: build
> Fix For: HBASE-14850
>
> Attachments: HBASE-20053-HBASE-14850.v001.patch
>
>
> There are .cmake files under hbase-native-client/cmake/ which are under 
> source control.
> The .cmake extension should be taken out of hbase-native-client/.gitignore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764227#comment-16764227
 ] 

stack commented on HBASE-21862:
---

Oh, looks like you changed the subject for this issue already [~Apache9] so 
could apply the patch against this JIRA. OK by you [~sershe]? Can make new 
issues for the other stuff discussed in here?

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread stack (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764214#comment-16764214
 ] 

stack commented on HBASE-21862:
---

bq. As a result region is opened on two servers.
bq. 2) Don't ignore invalid RS state, kill it (YouAreDead exception)

I thought the Master was doing this now. It is supposed to kill the RS that 
disagrees with its map of where Regions are deployed. Did that not happen here?

On the protocol, versioned state (in meta?) could be of use, yes.

On the 1., 2., 3. suggestions for beefing up the exchange, let's not go there 
(yet?). Our 'design' is the dumb one that [~Apache9] keeps repeating above. He 
seems  to have figured the root cause of why the 'design' failed in this case.

Patch looks great. Should it be a subtask? Could make a subtask with a simple 
objective. This issue seems more high-level discussion. Good one Duo.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
>

[jira] [Commented] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764209#comment-16764209
 ] 

Hadoop QA commented on HBASE-21478:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
55s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
33s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 30s{color} 
| {color:red} hbase-common generated 1 new + 42 unchanged - 0 fixed = 43 total 
(was 42) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} rubocop {color} | {color:green}  0m  
8s{color} | {color:green} There were no new rubocop issues. {color} |
| {color:green}+1{color} | {color:green} ruby-lint {color} | {color:green}  0m  
1s{color} | {color:green} There were no new ruby-lint issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
35s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 54s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
30s{color} | {color:green} hbase-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}132m 
21s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  7m  7s{color} 
| {color:red} hbase-shell in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
 6s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}190m 22s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.TestReplicationShell |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764199#comment-16764199
 ] 

Hadoop QA commented on HBASE-21862:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 8s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} The patch passed checkstyle in hbase-server {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 41s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
14s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}129m 12s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
53s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}176m 33s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.replication.TestSyncReplicationStandbyKillRS |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958134/HBASE-21862-forUT.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux e65ef0e121ba 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
|

[jira] [Updated] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21478:
-
Status: Patch Available  (was: Open)

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Minor
> Attachments: HBASE-21478.master.000.patch
>
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not sorted, so 
> not quite easy to read, like:
> {code}
> hbase(main):003:0> get_rsgroup 'default'
> GROUP INFORMATION
> ...
> Tables:
> table3
> ns2:table22
> table1
> ns1:table11
> ...
> {code}
> They could be sorted in the order of namespace then table name:
> {code}
> table1
> table3
> ns1:table11
> ns2:table22
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764178#comment-16764178
 ] 

Xiang Li edited comment on HBASE-21478 at 2/9/19 4:22 PM:
--

Without the patch:
{code:java}
hbase(main):002:0> list_rsgroups
NAME SERVER / 
TABLE   

 
 default server 
ip-10-23-10-9.ec2.internal:16020

   
 table 
ns1:t2  


 table 
hbase:meta  


 table t1   


   
 table 
hbase:rsgroup   


1 row(s)
Took 0.1516 seconds 


   

hbase(main):003:0> get_rsgroup 'default'
SERVERS 


   
ip-10-23-10-9.ec2.internal:16020


   
1 row(s)
TABLES  


   
ns1:t2  


   
hbase:meta  


   
t1  


   
hbase:rsgroup   


   
4 row(s)
Took 0.0160 seconds 
{code}
With the patch:
{code:java}
hbase(main):004:0> list_rsgroups
NAME SERVER / 
TABLE   

 
 default server 
ip-10-23-10-9.ec2.internal:16020

   
 table 
hbase:meta

[jira] [Comment Edited] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764178#comment-16764178
 ] 

Xiang Li edited comment on HBASE-21478 at 2/9/19 4:21 PM:
--

Without the patch:
{code}
hbase(main):004:0> list_rsgroups
NAME SERVER / 
TABLE   

 
 default server 
ip-10-23-10-9.ec2.internal:16020

   
 table 
hbase:meta  


 table 
hbase:rsgroup   


 table t1   


   
 table 
ns1:t2  


1 row(s)
Took 0.1633 seconds 


   

hbase(main):006:0> get_rsgroup 'default'
SERVERS 


   
ip-10-23-10-9.ec2.internal:16020


   
1 row(s)
TABLES  


   
hbase:meta  


   
hbase:rsgroup   


   
t1  


   
ns1:t2  


   
4 row(s)
{code}


was (Author: water):
Without the patch:
{code}
hbase(main):004:0> list_rsgroups
NAME SERVER / 
TABLE   

 
 default server 
ip-10-23-10-9.ec2.internal:16020

   
 table 
hbase:meta

[jira] [Commented] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764178#comment-16764178
 ] 

Xiang Li commented on HBASE-21478:
--

Without the patch:
{code}
hbase(main):004:0> list_rsgroups
NAME SERVER / 
TABLE   

 
 default server 
ip-10-23-10-9.ec2.internal:16020

   
 table 
hbase:meta  


 table 
hbase:rsgroup   


 table t1   


   
 table 
ns1:t2  


1 row(s)
Took 0.1633 seconds 


   
hbase(main):005:0> get_group 'default'
NoMethodError: undefined method `get_group' for main:Object
Did you mean?  get_rsgroup

hbase(main):006:0> get_rsgroup 'default'
SERVERS 


   
ip-10-23-10-9.ec2.internal:16020


   
1 row(s)
TABLES  


   
hbase:meta  


   
hbase:rsgroup   


   
t1  


   
ns1:t2  


   
4 row(s)
{code}

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Minor
> Attachments: HBASE-21478.master.000.patch
>
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not

[jira] [Updated] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated HBASE-21478:
-
Attachment: HBASE-21478.master.000.patch

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Minor
> Attachments: HBASE-21478.master.000.patch
>
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not sorted, so 
> not quite easy to read, like:
> {code}
> hbase(main):003:0> get_rsgroup 'default'
> GROUP INFORMATION
> ...
> Tables:
> table3
> ns2:table22
> table1
> ns1:table11
> ...
> {code}
> They could be sorted in the order of namespace then table name:
> {code}
> table1
> table3
> ns1:table11
> ns2:table22
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2019-02-09 Thread Xiang Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764177#comment-16764177
 ] 

Xiang Li commented on HBASE-21478:
--

Hi [~yuzhih...@gmail.com]
 I uploaded the very first patch v000 to demonstrate my idea.
 * Add a new function called "getTablesForDisplayInOrder()" into RSGroupInfo.
 ** Whenever it is called, the new private member "tablesForDisplayInOrder" is 
refreshed according to the "tables".
 ** The comparator compares namespace then qualifier, while it puts the system 
tables in front of the tables with other namespaces.
 ** "tablesForDisplayInOrder" is not allocated in the constructor but when the 
very first time getTablesForDisplayInOrder() is called, to save the memory.
 * For HBase shell, get_rsgroup.rb and list_rsgroups.rb are updated to use 
getTablesForDisplayInOrder() instead of getTables().
 * For HMaster web UI, rsgroup.jsp is updated to use 
getTablesForDisplayInOrder() instead of getTables and sort(). Without the 
change, it is already sorted according to namespace then qualifier, but it does 
not put the system tables first. I make the change here to make web be 
consistent with the shell.

Would you please help to review the patch at your convenience? I am trying to 
add some new UTs.

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Minor
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not sorted, so 
> not quite easy to read, like:
> {code}
> hbase(main):003:0> get_rsgroup 'default'
> GROUP INFORMATION
> ...
> Tables:
> table3
> ns2:table22
> table1
> ns1:table11
> ...
> {code}
> They could be sorted in the order of namespace then table name:
> {code}
> table1
> table3
> ns1:table11
> ns2:table22
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: HBASE-21862-forUT.patch

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
>

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: (was: HBASE-21862-forUT.patch)

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764162#comment-16764162
 ] 

Hadoop QA commented on HBASE-21862:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
46s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
35s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
3s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 37m 18s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958133/HBASE-21862-forUT.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 7ae90122e653 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15908/testReport/ |
| Max. process+thread count | 264 (vs. ulimit of 1) |
| modules | C: hbase-client U: hbase-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15908/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764161#comment-16764161
 ] 

Toshihiro Suzuki commented on HBASE-21201:
--

It looks like the javac errors in the last QA are for TestTableInputFormat 
class:
{code}
[WARNING] 
/testptch/hbase/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormat.java:[182,63]
 [DefaultCharset] Implicit use of the platform default charset, which can 
result in e.g. non-ASCII characters being silently replaced with '?' in many 
environments
[WARNING] 
/testptch/hbase/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormat.java:[188,39]
 [DefaultCharset] Implicit use of the platform default charset, which can 
result in e.g. non-ASCII characters being silently replaced with '?' in many 
environments
[WARNING] 
/testptch/hbase/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormat.java:[188,63]
 [DefaultCharset] Implicit use of the platform default charset, which can 
result in e.g. non-ASCII characters being silently replaced with '?' in many 
environments
[WARNING] 
/testptch/hbase/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormat.java:[212,36]
 [DefaultCharset] Implicit use of the platform default charset, which can 
result in e.g. non-ASCII characters being silently replaced with '?' in many 
environments
{code}

It seems like they are unrelated to the batch. I will commit the latest patch.

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21512) Introduce an AsyncClusterConnection and replace the usage of ClusterConnection

2019-02-09 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764156#comment-16764156
 ] 

Hudson commented on HBASE-21512:


Results for branch HBASE-21512
[build #93 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/93/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/93//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/93//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/93//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Introduce an AsyncClusterConnection and replace the usage of ClusterConnection
> --
>
> Key: HBASE-21512
> URL: https://issues.apache.org/jira/browse/HBASE-21512
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
>
> At least for the RSProcedureDispatcher, with CompletableFuture we do not need 
> to set a delay and use a thread pool any more, which could reduce the 
> resource usage and also the latency.
> Once this is done, I think we can remove the ClusterConnection completely, 
> and start to rewrite the old sync client based on the async client, which 
> could reduce the code base a lot for our client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764154#comment-16764154
 ] 

Hadoop QA commented on HBASE-21201:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 6s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
37s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 32s{color} 
| {color:red} hbase-mapreduce generated 4 new + 154 unchanged - 4 fixed = 158 
total (was 158) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} hbase-mapreduce: The patch generated 0 new + 17 
unchanged - 2 fixed = 17 total (was 19) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
13s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 38s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
21s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 44m  9s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21201 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958131/HBASE-21201.master.004.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 85231a7ba9df 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| javac | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15907/artifact/patchprocess/diff-compile-javac-hbase-mapreduce.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15907/testReport/ |
| Max. process+thread count | 5278 (vs. ulimit of 1) |
|

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: HBASE-21862-forUT.patch

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
>

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Attachment: HBASE-21201.master.004.patch

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764146#comment-16764146
 ] 

Hadoop QA commented on HBASE-21862:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
37s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
36s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
10s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 37m 39s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958130/HBASE-21862-v1.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux d05158629be0 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15906/testReport/ |
| Max. process+thread count | 264 (vs. ulimit of 1) |
| modules | C: hbase-client U: hbase-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15906/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764145#comment-16764145
 ] 

Hadoop QA commented on HBASE-21862:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
13s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 9s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 39s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
16s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 34m 36s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958130/HBASE-21862-v1.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 24c930bfa854 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15905/testReport/ |
| Max. process+thread count | 296 (vs. ulimit of 1) |
| modules | C: hbase-client U: hbase-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15905/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: HBASE-21862-v1.patch

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-v1.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764128#comment-16764128
 ] 

Hadoop QA commented on HBASE-21862:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 9s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
54s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} hbase-client: The patch generated 0 new + 3 
unchanged - 3 fixed = 3 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 9s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 33s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m  
6s{color} | {color:red} hbase-client generated 1 new + 0 unchanged - 0 fixed = 
1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
15s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 34m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hbase-client |
|  |  Exception is caught when Exception is not thrown in 
org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(InetSocketAddress, Throwable) 
 At IPCUtil.java:is not thrown in 
org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(InetSocketAddress, Throwable) 
 At IPCUtil.java:[line 187] |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21862 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958129/HBASE-21862.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux ee66af0630aa 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java |

[jira] [Commented] (HBASE-21135) Build fails on windows as it fails to parse windows path during license check

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764121#comment-16764121
 ] 

Nihal Jain commented on HBASE-21135:


[~busbey] If the patch looks fine, could you commit this to affected branches?

> Build fails on windows as it fails to parse windows path during license check
> -
>
> Key: HBASE-21135
> URL: https://issues.apache.org/jira/browse/HBASE-21135
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Affects Versions: 3.0.0, 1.4.0, 1.3.2, 1.1.12, 1.2.7, 2.1.1
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
>  Labels: cygwin
> Fix For: 3.0.0
>
> Attachments: HBASE-21135.master.001.patch
>
>
> License check via enforce plugin throws following error during build on 
> windows:
> {code:java}
> Sourced file: inline evaluation of: ``File license = new 
> File("D:\DS\HBase_2\hbase\hbase-shaded\target/maven-shared-ar . . . '' Token 
> Parsing Error: Lexical error at line 1, column 29.  Encountered: "D" (68), 
> after : "\"D:\\": {code}
> Complete stacktrace with command
> {code:java}
> mvn clean install -DskipTests -X
> {code}
> is as follows:
> {noformat}
> [INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (check-aggregate-license) @ 
> hbase-shaded ---
> [DEBUG] Configuring mojo 
> org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1:enforce from plugin 
> realm 
> ClassRealm[plugin>org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1, 
> parent: sun.misc.Launcher$AppClassLoader@55f96302]
> [DEBUG] Configuring mojo 
> 'org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1:enforce' with basic 
> configurator -->
> [DEBUG]   (s) fail = true
> [DEBUG]   (s) failFast = false
> [DEBUG]   (f) ignoreCache = false
> [DEBUG]   (f) mojoExecution = 
> org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1:enforce {execution: 
> check-aggregate-license}
> [DEBUG]   (s) project = MavenProject: 
> org.apache.hbase:hbase-shaded:2.1.1-SNAPSHOT @ 
> D:\DS\HBase_2\hbase\hbase-shaded\pom.xml
> [DEBUG]   (s) condition = File license = new 
> File("D:\DS\HBase_2\hbase\hbase-shaded\target/maven-shared-archive-resources/META-INF/LICENSE");
> // Beanshell does not support try-with-resources,
> // so we must close this scanner manually
> Scanner scanner = new Scanner(license);
> while (scanner.hasNextLine()) {
>   if (scanner.nextLine().startsWith("ERROR:")) {
> scanner.close();
> return false;
>   }
> }
> scanner.close();
> return true;
> [DEBUG]   (s) message = License errors detected, for more detail find ERROR in
> 
> D:\DS\HBase_2\hbase\hbase-shaded\target/maven-shared-archive-resources/META-INF/LICENSE
> [DEBUG]   (s) rules = 
> [org.apache.maven.plugins.enforcer.EvaluateBeanshell@7e307087]
> [DEBUG]   (s) session = org.apache.maven.execution.MavenSession@5e1218b4
> [DEBUG]   (s) skip = false
> [DEBUG] -- end configuration --
> [DEBUG] Executing rule: org.apache.maven.plugins.enforcer.EvaluateBeanshell
> [DEBUG] Echo condition : File license = new 
> File("D:\DS\HBase_2\hbase\hbase-shaded\target/maven-shared-archive-resources/META-INF/LICENSE");
> // Beanshell does not support try-with-resources,
> // so we must close this scanner manually
> Scanner scanner = new Scanner(license);
> while (scanner.hasNextLine()) {
>   if (scanner.nextLine().startsWith("ERROR:")) {
> scanner.close();
> return false;
>   }
> }
> scanner.close();
> return true;
> [DEBUG] Echo script : File license = new 
> File("D:\DS\HBase_2\hbase\hbase-shaded\target/maven-shared-archive-resources/META-INF/LICENSE");
> // Beanshell does not support try-with-resources,
> // so we must close this scanner manually
> Scanner scanner = new Scanner(license);
> while (scanner.hasNextLine()) {
>   if (scanner.nextLine().startsWith("ERROR:")) {
> scanner.close();
> return false;
>   }
> }
> scanner.close();
> return true;
> [DEBUG] Adding failure due to exception
> org.apache.maven.enforcer.rule.api.EnforcerRuleException: Couldn't evaluate 
> condition: File license = new 
>

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-09 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764120#comment-16764120
 ] 

Nihal Jain commented on HBASE-21636:


{quote} 
{code:java}
replica_id = args[REGION_REPLICA_ID]
isolation_level = args[ISOLATION_LEVEL]
read_type = args[READ_TYPE]
{code}
 Dumb question: how do these fields get the default values?
{quote}
The following lines ensure that we set the corresponding fields for {{scan}} 
object only if the corresponding argument was passed by user. 
{code:java}
563 scan.setReplicaId(replica_id) if replica_id
564 
scan.setIsolationLevel(org.apache.hadoop.hbase.client.IsolationLevel.valueOf(isolation_level))
 if isolation_level
565 
scan.setReadType(org.apache.hadoop.hbase.client::Scan::ReadType.valueOf(read_type))
 if read_type
{code}
For example: {{replica_id = args[REGION_REPLICA_ID]}} would initialize 
{{replica_id}} to some {{non-nil}} value if user passes it in {{args}}, 
otherwise it is initialized as {{nil}}.
 Next, the body of statement: {{scan.setReplicaId(replica_id) if replica_id}}, 
would execute if the condition: {{if replica_id}} evaluates to {{true}}.

In case {{replica_id}} is {{nil}}, we won't even execute 
{{scan.setReplicaId(replica_id).}} So, we don't need to initialize to default 
values, as we won't even make the scan.setX() call.

I hope this makes sense.

[Sorry for delayed reply, was away from my laptop]

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21636.master.001.patch, 
> HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764116#comment-16764116
 ] 

Hadoop QA commented on HBASE-21201:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
34s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 7s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 32s{color} 
| {color:red} hbase-mapreduce generated 4 new + 154 unchanged - 4 fixed = 158 
total (was 158) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} hbase-mapreduce: The patch generated 0 new + 17 
unchanged - 2 fixed = 17 total (was 19) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 8s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 33s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
11s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 43m 58s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21201 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958126/HBASE-21201.master.003.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 2bcc48cf56c7 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| javac | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15903/artifact/patchprocess/diff-compile-javac-hbase-mapreduce.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15903/testReport/ |
| Max. process+thread count | 5288 (vs. ulimit of 1) |
|

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Assignee: Duo Zhang
  Status: Patch Available  (was: Open)

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764114#comment-16764114
 ] 

Duo Zhang commented on HBASE-21862:
---

Review board link:

https://reviews.apache.org/r/69934/

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
>

[jira] [Commented] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764113#comment-16764113
 ] 

Duo Zhang commented on HBASE-21862:
---

PTAL sir [~stack]. This is a very bad issue, we should include it in 2.0.5 
release too.

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Attachment: HBASE-21862.patch

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING,

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Attachment: HBASE-21201.master.003.patch

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Fix Version/s: 2.3.0
   2.0.5
   2.1.3
   2.2.0
   3.0.0

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
>

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21853:
--
Fix Version/s: (was: 2.1.4)
   2.1.3

> update copyright notices to 2019
> 
>
> Key: HBASE-21853
> URL: https://issues.apache.org/jira/browse/HBASE-21853
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 1.4.10, 2.1.3, 2.0.5, 1.3.4, 1.2.11, 
> 2.3.0, 1.5.1
>
> Attachments: HBASE-21853.0.patch
>
>
> we've had copyrightable changes put in place since the new year, so update 
> the date range.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764110#comment-16764110
 ] 

Hadoop QA commented on HBASE-21201:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
10s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 32s{color} 
| {color:red} hbase-mapreduce generated 4 new + 154 unchanged - 4 fixed = 158 
total (was 158) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} hbase-mapreduce: The patch generated 0 new + 17 
unchanged - 2 fixed = 17 total (was 19) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
11s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 35s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
17s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 43m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21201 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958122/HBASE-21201.master.003.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux c3f49e02775c 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / abaeeace00 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| javac | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15902/artifact/patchprocess/diff-compile-javac-hbase-mapreduce.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15902/testReport/ |
| Max. process+thread count | 5330 (vs. ulimit of 1) |
|

[jira] [Updated] (HBASE-21857) Do not need to check clusterKey if replicationEndpoint is provided when adding a peer

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21857:
--
Fix Version/s: (was: 2.1.4)
   2.1.3

> Do not need to check clusterKey if replicationEndpoint is provided when 
> adding a peer
> -
>
> Key: HBASE-21857
> URL: https://issues.apache.org/jira/browse/HBASE-21857
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21857.patch
>
>
> The clusterKey check is done in HBASE-19630, which is part of the work for 
> HBASE-19397.
> In HBASE-19630 we claim that we always check clusterKey when adding a peer at 
> RS side, but this is not true, as clusterKey could be null. And it will be 
> strange that if we implement a ReplicationEndpoint to kafka and we still need 
> to provide a cluster key in the hbase format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21816) Print source cluster replication config directory

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21816:
--
Fix Version/s: (was: 2.1.4)
   2.1.3

> Print source cluster replication config directory
> -
>
> Key: HBASE-21816
> URL: https://issues.apache.org/jira/browse/HBASE-21816
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 3.0.0, 2.0.0
> Environment: NA
>Reporter: Karthik Palanisamy
>Assignee: Karthik Palanisamy
>Priority: Trivial
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21816-001.patch, HBASE-21816-002.patch, 
> HBASE-21816-003.patch, HBASE-21816.master.001.patch
>
>
> User may get confused, to understanding our HBase configurations which are 
> loaded for replication. Sometimes, User may place source and destination 
> cluster conf under "/etc/hbase/conf" directory. It will create uncertainty 
> because our log points that all the configurations are co-located.
>  
> Existing Logs, 
> {code:java}
> INFO  [RpcServer.FifoWFPBQ.replication.handler=2,queue=0,port=16020] 
> regionserver.DefaultSourceFSConfigurationProvider: Loading source cluster 
> HDP1 file system configurations from xml files under directory 
> /etc/hbase/conf/
> {code}
> But it should be something like,
> {code:java}
> INFO  [RpcServer.FifoWFPBQ.replication.handler=2,queue=0,port=16020] 
> regionserver.DefaultSourceFSConfigurationProvider: Loading source cluster 
> HDP1 file system configurations from xml files under directory 
> /etc/hbase/conf/HDP1
> {code}
>  
> This jira only to change the log-line, no issue with the functionality. 
> {code:java}
> File confDir = new File(replicationConfDir, replicationClusterId);
> String[] listofConfFiles = FileUtil.list(confDir);
> for (String confFile : listofConfFiles) {
> if (new File(confDir, confFile).isFile() && confFile.endsWith(XML)) {
> // Add all the user provided client conf files
> sourceClusterConf.addResource(new Path(confDir.getPath(), confFile));
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Description: 
It's a classic bug, sort of... the call times out to open the region, but RS 
actually processes it alright. It could also happen if the response didn't make 
it back due to a network issue.
As a result region is opened on two servers.
There are some mitigations possible to narrow down the race window.
1) Don't process expired open calls, fail them. Won't help for network issues.
2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
will require fixing other network races where master kills RS, which would 
require adding state versioning to the protocol.

The fundamental fix though would require either
1) an unknown failure from open to ascertain the state of the region from the 
server. Again, this would probably require protocol changes to make sure we 
ascertain the region is not opened, and also that the already-failed-on-master 
open is NOT going to be processed if it's some queue or even in transit on the 
network (via a nonce-like mechanism)?
2) some form of a distributed lock per region, e.g. in ZK
3) some form of 2PC? but the participant list cannot be determined in a manner 
that's both scalable and guaranteed correct. Theoretically it could be all RSes.


{noformat}
2019-02-08 03:21:31,715 INFO  [PEWorker-7] procedure.MasterProcedureScheduler: 
Took xlock for pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN
2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
location=server1,17020,1549567999303; forceNewPlan=false, retain=true
2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
regionState=OPENING, regionLocation=server1,17020,1549623714617
2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
ppid=260626, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... to 
server server1,17020,1549623714617 failed
java.io.IOException: Call to server1/...:17020 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
waitTime=60145, rpcTimeout=6^M
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
...
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
waitTime=60145, rpcTimeout=6^M
at 
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
... 4 more^M
{noformat}
RS:
{noformat}
hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
[RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: Open 
...d0214809147e43dc6870005742d5d204.
...
hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
[RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
Opened ...d0214809147e43dc6870005742d5d204.
{noformat}
Retry:
{noformat}
2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=260626, 
ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
hasLock=true; TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
location=server1,17020,1549623714617
2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; 
forceNewPlan=true, retain=false
2019-02-08 03:22:33,238 INFO  [PEWorker-7] assignment.RegionStateStore: 
pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
regionState=OPENING, regionLocation=server2,17020,1549569075319
{noformat}
The ignore-message
{noformat}
2019-02-08 03:25:44,754 WARN  
[RpcServer.default.FPBQ.Fifo.handler=34,queue=4,port=17000] 
assignment.TransitRegionStateProcedure: Received report OPENED transition from 
server1,17020,1549623714617 for rit=OPENING, 
location=server2,17020,1549569075319, table=table, 
region=d0214809147e43dc6870005742d5d204, pid=260626 but the region is not on 
it, should be a retry, ignore
{noformat}
The 2nd assignment
{noformat}
2019-02-08 03:26:18,915 INFO  [PEWorker-7]

[jira] [Commented] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764108#comment-16764108
 ] 

Duo Zhang commented on HBASE-21862:
---

Oh shit, I think the problem here is that we wrap the original 
CallTimeoutException with an IOException? I think this will be a big problem 
for all 2.0+ branches. Let me fix...

> region can be assigned to 2 servers due to a timed-out call
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING,

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should not wrap all connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Summary: IPCUtil.wrapException should not wrap all connection exceptions  
(was: region can be assigned to 2 servers due to a timed-out call)

> IPCUtil.wrapException should not wrap all connection exceptions
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; 
> forceNewPlan=true, retain=false
> 2019-02-08

[jira] [Updated] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-09 Thread Duo Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21862:
--
Summary: IPCUtil.wrapException should keep the original exception types for 
all the connection exceptions  (was: IPCUtil.wrapException should not wrap all 
connection exceptions)

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
>

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764107#comment-16764107
 ] 

Toshihiro Suzuki commented on HBASE-21201:
--

Thank you for reviewing. [~elserj]

I restored the assert statements and added a test with the snapshot support in 
the v3 patch. If QA is okay, will commit.

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21201) Support to run VerifyReplication MR tool without peerid

2019-02-09 Thread Toshihiro Suzuki (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiro Suzuki updated HBASE-21201:
-
Attachment: HBASE-21201.master.003.patch

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764105#comment-16764105
 ] 

Duo Zhang commented on HBASE-21863:
---

Please see my last several posts in HBASE-21862, there is no such ‘double 
assign’ window by design. Please focus on why we give up on a 
CallTimeoutException.

Thanks.

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764104#comment-16764104
 ] 

Duo Zhang commented on HBASE-21862:
---

And there is no timeout for opening a region either. You can not do this 
otherwise there will be double assign. The intention here is that, if the rs is 
alive then it will finally report the open result back, otherwise it must be 
dead, or at least it will abort it self. Then the SCP will take care the 
reassign.

> region can be assigned to 2 servers due to a timed-out call
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
>

[jira] [Commented] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call

2019-02-09 Thread Duo Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764103#comment-16764103
 ] 

Duo Zhang commented on HBASE-21862:
---

Let me explain again what is the design so folks will not go to the wrong 
direction and make the code confusing.

The executeProcedures call just sends the procedures to regionserver, without 
waiting for the procedures’ result. You can see the code, we will just ignore 
the response of the executeProcedures call, unless there is an exception.

At RS side, we will execute the procedures in a thread pool. And we will report 
back to master when the procedures are done. We will retry forever here, and if 
failed, the RS will abort itself.

So at master side, the logic is that, we will retry forever unless we know that 
the RS is dead or some other conditions(see the code in the shouldRetry 
method). And obviously the problem here is that, we give up retrying on a 
CallTimeoutException, which is a no no. Need to check the code in the 
shouldRetry method to see why this could happen.

> region can be assigned to 2 servers due to a timed-out call
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Priority: Blocker
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened

98 matches

Mail list logo