[jira] [Updated] (HBASE-26548) Investigate mTLS in RPC layer

2021-12-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26548:
-
Description: 
Current authentication options are heavily based on SASL and Kerberos. For 
organizations that don't already deploy Kerberos or other token provider, this 
is a heavy lift. Another very common way of authenticating in the industry is 
mTLS, which makes use of SSL certifications and can solve both wire encryption 
and auth. For those already deploying trusted certificates in their infra, mTLS 
may be much easier to integrate.

It isn't necessarily easy to implement this, but I do think we could use 
existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I know 
it's easy to add SSL to non-blocking IO through a 
hadoop.rpc.socket.factory.class.default which returns SSLSockets, but that 
doesn't touch on the certification verification at all.

Much more investigation is needed, but logging this due to some interest 
encountered on slack.

Slack thread: 
https://apache-hbase.slack.com/archives/C13K8NVAM/p1638980520110600

  was:
Current authentication options are heavily based on SASL and Kerberos. For 
organizations that don't already deploy Kerberos or other token provider, this 
is a heavy lift. Another very common way of authenticating in the industry is 
mTLS, which makes use of SSL certifications and can solve both wire encryption 
and auth. For those already deploying trusted certificates in their infra, mTLS 
may be much easier to integrate.

It isn't necessarily easy to implement this, but I do think we could use 
existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I also 
know it's easy to add SSL to non-blocking IO through a 
hadoop.rpc.socket.factory.class.default which returns SSLSockets (we do this at 
my company), but that doesn't touch on the certification verification at all.

Much more investigation is needed, but logging this due to some interest 
encountered on slack.


> Investigate mTLS in RPC layer
> -
>
> Key: HBASE-26548
> URL: https://issues.apache.org/jira/browse/HBASE-26548
> Project: HBase
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Priority: Major
>
> Current authentication options are heavily based on SASL and Kerberos. For 
> organizations that don't already deploy Kerberos or other token provider, 
> this is a heavy lift. Another very common way of authenticating in the 
> industry is mTLS, which makes use of SSL certifications and can solve both 
> wire encryption and auth. For those already deploying trusted certificates in 
> their infra, mTLS may be much easier to integrate.
> It isn't necessarily easy to implement this, but I do think we could use 
> existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I know 
> it's easy to add SSL to non-blocking IO through a 
> hadoop.rpc.socket.factory.class.default which returns SSLSockets, but that 
> doesn't touch on the certification verification at all.
> Much more investigation is needed, but logging this due to some interest 
> encountered on slack.
> Slack thread: 
> https://apache-hbase.slack.com/archives/C13K8NVAM/p1638980520110600



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-25051) DIGEST based auth broken for MasterRegistry

2021-08-13 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17398980#comment-17398980
 ] 

Bharath Vissapragada edited comment on HBASE-25051 at 8/13/21, 11:11 PM:
-

I think one way to fix this problem is to "advertise" the cluster ID as a part 
of channel setup. This is an IPC protocol change. Currently the way connection 
setup works is 

1. client to server: socket connect()
2. client sends a connection preamble (validated on server and connection 
closed if malformed)
3. client optionally does a sasl handshake (if configured)

(3) is where cluster ID is needed (if token is configured as a part of DIGEST 
based auth). 

Now my proposal is to modify it as follows.

1. client to server: socket connect()
2. server responds with a 16 byte UUID that is read by the client (client can 
get the actual SASL mode with this clusterId info and looking up the tokens)
3. client sends a connection preamble (validated on server and connection 
closed if malformed)
4. client optionally does a sasl handshake (using the sasl and token from 
step(2)).

A sample implementation using netty IPC is something like this 
https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814
 . The patch still needs to do the following,

1. Move providers.selectProvider(clusterId, ticket) .. to after step (2)
2. Implement an equivalent change in Blocking IPC.

One problem I can think of is compatibility between old client and new server 
(say during a rolling upgrade when server is upgraded first), I can see if I 
can get it working by making client ignore this piece of response. 

cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a 
better way?



was (Author: bharathv):
I think one way to fix this problem is to "advertise" the cluster ID as a part 
of channel setup. This is an IPC protocol change. Currently the way connection 
setup works is 

1. client to server: socket connect()
2. client sends a connection preamble (validated on server and connection 
closed if malformed)
3. client optionally does a sasl handshake (if configured)

(3) is where cluster ID is needed (if token is configured as a part of DIGEST 
based auth). 

Now my proposal is to modify it as follows.

1. client to server: socket connect()
2. server responds with a 16 byte UUID that is read by the client (client can 
get the actual SASL mode with this clusterId info and looking up the tokens)
2. client sends a connection preamble (validated on server and connection 
closed if malformed)
3. client optionally does a sasl handshake (using the sasl and token from 
step(2)).

A sample implementation using netty IPC is something like this 
https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814
 . The patch still needs to do the following,

1. Move providers.selectProvider(clusterId, ticket) .. to after step (2)
2. Implement an equivalent change in Blocking IPC.

One problem I can think of is compatibility between old client and new server 
(say during a rolling upgrade when server is upgraded first), I can see if I 
can get it working by making client ignore this piece of response. 

cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a 
better way?


> DIGEST based auth broken for MasterRegistry
> ---
>
> Key: HBASE-25051
> URL: https://issues.apache.org/jira/browse/HBASE-25051
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client, security
>Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Minor
>
> DIGEST-MD5 based sasl auth depends on cluster-ID to obtain tokens. With 
> master registry, we have a circular dependency here because master registry 
> needs an rpcClient to talk to masters (and to get cluster ID) and rpc-Client 
> needs a clusterId if DIGEST based auth is configured. Earlier, there was a ZK 
> client that has its own authentication mechanism to fetch the cluster ID.
> HBASE-23330, I think doesn't fully fix the problem. It depends on an active 
> connection to fetch delegation tokens for the MR job and that inherently 
> assumes that the active connection does not use a DIGEST auth.
> It is not clear to me how common it is to use DIGEST based auth in 
> connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25051) DIGEST based auth broken for MasterRegistry

2021-08-13 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17398980#comment-17398980
 ] 

Bharath Vissapragada commented on HBASE-25051:
--

I think one way to fix this problem is to "advertise" the cluster ID as a part 
of channel setup. This is an IPC protocol change. Currently the way connection 
setup works is 

1. client to server: socket connect()
2. client sends a connection preamble (validated on server and connection 
closed if malformed)
3. client optionally does a sasl handshake (if configured)

(3) is where cluster ID is needed (if token is configured as a part of DIGEST 
based auth). 

Now my proposal is to modify it as follows.

1. client to server: socket connect()
2. server responds with a 16 byte UUID that is read by the client (client can 
get the actual SASL mode with this clusterId info and looking up the tokens)
2. client sends a connection preamble (validated on server and connection 
closed if malformed)
3. client optionally does a sasl handshake (using the sasl and token from 
step(2)).

A sample implementation using netty IPC is something like this 
https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814
 . The patch still needs to do the following,

1. Move providers.selectProvider(clusterId, ticket) .. to after step (2)
2. Implement an equivalent change in Blocking IPC.

One problem I can think of is compatibility between old client and new server 
(say during a rolling upgrade when server is upgraded first), I can see if I 
can get it working by making client ignore this piece of response. 

cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a 
better way?


> DIGEST based auth broken for MasterRegistry
> ---
>
> Key: HBASE-25051
> URL: https://issues.apache.org/jira/browse/HBASE-25051
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client, security
>Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Minor
>
> DIGEST-MD5 based sasl auth depends on cluster-ID to obtain tokens. With 
> master registry, we have a circular dependency here because master registry 
> needs an rpcClient to talk to masters (and to get cluster ID) and rpc-Client 
> needs a clusterId if DIGEST based auth is configured. Earlier, there was a ZK 
> client that has its own authentication mechanism to fetch the cluster ID.
> HBASE-23330, I think doesn't fully fix the problem. It depends on an active 
> connection to fetch delegation tokens for the MR job and that inherently 
> assumes that the active connection does not use a DIGEST auth.
> It is not clear to me how common it is to use DIGEST based auth in 
> connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26195) Data is present in replicated cluster but not present in primary cluster.

2021-08-12 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26195:
-
Affects Version/s: 2.5.0
   3.0.0-alpha-1

> Data is present in replicated cluster but not present in primary cluster.
> -
>
> Key: HBASE-26195
> URL: https://issues.apache.org/jira/browse/HBASE-26195
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> We encountered a case where we are seeing some rows (via Phoenix) in 
> replicated cluster but they are not present in source/active cluster.
> Triaging further we found memstore rollback logs in few of the region servers.
> {noformat}
> 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> {noformat}
> Looking more into logs, found that there were some hdfs layer issues sync'ing 
> wal to hdfs.
> It was taking around 6 mins to sync wal. Logs below
> {noformat}
> 2021-07-28 14:19:30,511 WARN  [sync.0] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391210ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.1] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391148ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.2] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391147ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.0] wal.FSHLog - Slow sync cost: 391289 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.1] wal.FSHLog - Slow sync cost: 391227 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.1] wal.FSHLog - Requesting log roll 
> because we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> 

[jira] [Commented] (HBASE-26149) Further improvements on ConnectionRegistry implementations

2021-08-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396413#comment-17396413
 ] 

Bharath Vissapragada commented on HBASE-26149:
--

> This must have been in place always?

[~stack] Please see HBASE-25051's description. Earlier we relied on an 
authenticated ZK client to get the cluster ID. Now its a chicken and egg 
problem if we are removing ZK from the picture. 

> Further improvements on ConnectionRegistry implementations
> --
>
> Key: HBASE-26149
> URL: https://issues.apache.org/jira/browse/HBASE-26149
> Project: HBase
>  Issue Type: Umbrella
>  Components: Client
>Reporter: Duo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26158) MiniZooKeeperCluster should not be IA.Public

2021-08-03 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392420#comment-17392420
 ] 

Bharath Vissapragada commented on HBASE-26158:
--

As Geoffrey mentioned, we use this class quite extensively as shared ZK test 
instance across multiple mini clusters and other services. It is a very handy 
wrapper that is tightly integrated with mini cluster and handles all the boiler 
plate code for us. We also don't have to worry about curator version 
compatibility with hbase. 

> MiniZooKeeperCluster should not be IA.Public
> 
>
> Key: HBASE-26158
> URL: https://issues.apache.org/jira/browse/HBASE-26158
> Project: HBase
>  Issue Type: Sub-task
>  Components: API, test
>Reporter: Duo Zhang
>Priority: Major
>
> End users do not need to test HBase when zookeeper is broken. And if users 
> want to start only a zookeeper cluster, they can just use curator-test, so I 
> do not think we should expose this class as IA.Public.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics

2021-08-02 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25469:
-
Affects Version/s: 2.3.7
   2.4.6
   3.0.0-alpha-1

> Create RIT servlet in HMaster to track more detailed RIT info not captured in 
> metrics
> -
>
> Key: HBASE-25469
> URL: https://issues.apache.org/jira/browse/HBASE-25469
> Project: HBase
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 3.0.0-alpha-1, 2.4.6, 2.3.7
>Reporter: Caroline Zhou
>Assignee: Caroline Zhou
>Priority: Minor
>  Labels: observability
> Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot 
> 2021-07-27 at 10.34.53.png
>
>
> In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410.
> There are some additional RIT details that would be helpful to have in one 
> place:
>  * RIT Start Time
>  * RIT Duration (ms)
>  * Server
>  * Procedure Type
> This info can be added to the table under the {{/rit.jsp}} page, and we can 
> also add a button on that page to view info as JSON, for easy parsing into 
> metrics, etc. This JSON dump can be served as a servlet.
> We may also consider different ways of grouping the JSON results, such as by 
> state, table, or server name, and/or adding counts of RIT by state or server 
> name.
> !Screen Shot 2021-07-27 at 10.34.45.png!
> !Screen Shot 2021-07-27 at 10.34.53.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics

2021-08-02 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25469:
-
Labels: observability  (was: )

> Create RIT servlet in HMaster to track more detailed RIT info not captured in 
> metrics
> -
>
> Key: HBASE-25469
> URL: https://issues.apache.org/jira/browse/HBASE-25469
> Project: HBase
>  Issue Type: Improvement
>  Components: master
>Reporter: Caroline Zhou
>Assignee: Caroline Zhou
>Priority: Minor
>  Labels: observability
> Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot 
> 2021-07-27 at 10.34.53.png
>
>
> In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410.
> There are some additional RIT details that would be helpful to have in one 
> place:
>  * RIT Start Time
>  * RIT Duration (ms)
>  * Server
>  * Procedure Type
> This info can be added to the table under the {{/rit.jsp}} page, and we can 
> also add a button on that page to view info as JSON, for easy parsing into 
> metrics, etc. This JSON dump can be served as a servlet.
> We may also consider different ways of grouping the JSON results, such as by 
> state, table, or server name, and/or adding counts of RIT by state or server 
> name.
> !Screen Shot 2021-07-27 at 10.34.45.png!
> !Screen Shot 2021-07-27 at 10.34.53.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics

2021-08-02 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25469:
-
Component/s: master

> Create RIT servlet in HMaster to track more detailed RIT info not captured in 
> metrics
> -
>
> Key: HBASE-25469
> URL: https://issues.apache.org/jira/browse/HBASE-25469
> Project: HBase
>  Issue Type: Improvement
>  Components: master
>Reporter: Caroline Zhou
>Assignee: Caroline Zhou
>Priority: Minor
> Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot 
> 2021-07-27 at 10.34.53.png
>
>
> In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410.
> There are some additional RIT details that would be helpful to have in one 
> place:
>  * RIT Start Time
>  * RIT Duration (ms)
>  * Server
>  * Procedure Type
> This info can be added to the table under the {{/rit.jsp}} page, and we can 
> also add a button on that page to view info as JSON, for easy parsing into 
> metrics, etc. This JSON dump can be served as a servlet.
> We may also consider different ways of grouping the JSON results, such as by 
> state, table, or server name, and/or adding counts of RIT by state or server 
> name.
> !Screen Shot 2021-07-27 at 10.34.45.png!
> !Screen Shot 2021-07-27 at 10.34.53.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release

2021-07-22 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385650#comment-17385650
 ] 

Bharath Vissapragada commented on HBASE-26109:
--

1.7.1 links are up, waiting for 1.7.0 to be deleted in all the mirrors. Then I 
can commit HBASE-26071 and send out a notice.

> Wrap up 1.7.1 release
> -
>
> Key: HBASE-26109
> URL: https://issues.apache.org/jira/browse/HBASE-26109
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: release
> Fix For: 1.7.1
>
>
> - Set branch-1 version to 1.7.2-SNAPSHOT
> - Update SVN release bits, add download links to 1.7.1
> - Remove 1.7.0 from the mirrors
> - Commit HBASE-26071
> in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26110) Add download links for 1.7.1

2021-07-22 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26110:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks for the reviews.

> Add download links for 1.7.1
> 
>
> Key: HBASE-26110
> URL: https://issues.apache.org/jira/browse/HBASE-26110
> Project: HBase
>  Issue Type: Sub-task
>  Components: website
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26110) Add download links for 1.7.1

2021-07-21 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26110:
-
Status: Patch Available  (was: Open)

> Add download links for 1.7.1
> 
>
> Key: HBASE-26110
> URL: https://issues.apache.org/jira/browse/HBASE-26110
> Project: HBase
>  Issue Type: Sub-task
>  Components: website
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26110) Add download links for 1.7.1

2021-07-21 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26110:


 Summary: Add download links for 1.7.1
 Key: HBASE-26110
 URL: https://issues.apache.org/jira/browse/HBASE-26110
 Project: HBase
  Issue Type: Sub-task
  Components: website
Affects Versions: 1.7.1
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada
 Fix For: 1.7.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release

2021-07-21 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385117#comment-17385117
 ] 

Bharath Vissapragada commented on HBASE-26109:
--

Waiting for the mirrors to catch up so that I can make the doc change.

> Wrap up 1.7.1 release
> -
>
> Key: HBASE-26109
> URL: https://issues.apache.org/jira/browse/HBASE-26109
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: release
> Fix For: 1.7.1
>
>
> - Set branch-1 version to 1.7.2-SNAPSHOT
> - Update SVN release bits, add download links to 1.7.1
> - Remove 1.7.0 from the mirrors
> - Commit HBASE-26071
> in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release

2021-07-21 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385089#comment-17385089
 ] 

Bharath Vissapragada commented on HBASE-26109:
--

Looks like I don't have access to the svn repo to move content around. Working 
with Andrew offline to move the RC bits to release.

{noformat}
svn mv https://dist.apache.org/repos/dist/dev/hbase/1.7.1RC0 
https://dist.apache.org/repos/dist/release/hbase/1.7.1 -m"Release Apache HBase 
1.7.1"
svn: E175013: Access to '/repos/dist/!svn/rvr/48933/dev/hbase/1.7.1RC0' 
forbidden
{noformat}

> Wrap up 1.7.1 release
> -
>
> Key: HBASE-26109
> URL: https://issues.apache.org/jira/browse/HBASE-26109
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: release
> Fix For: 1.7.1
>
>
> - Set branch-1 version to 1.7.2-SNAPSHOT
> - Update SVN release bits, add download links to 1.7.1
> - Remove 1.7.0 from the mirrors
> - Commit HBASE-26071
> in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26109) Wrap up 1.7.1 release

2021-07-21 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26109:
-
Affects Version/s: 1.7.1

> Wrap up 1.7.1 release
> -
>
> Key: HBASE-26109
> URL: https://issues.apache.org/jira/browse/HBASE-26109
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.7.1
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: release
> Fix For: 1.7.1
>
>
> - Set branch-1 version to 1.7.2-SNAPSHOT
> - Update SVN release bits, add download links to 1.7.1
> - Remove 1.7.0 from the mirrors
> - Commit HBASE-26071
> in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26109) Wrap up 1.7.1 release

2021-07-21 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26109:
-
Labels: release  (was: )

> Wrap up 1.7.1 release
> -
>
> Key: HBASE-26109
> URL: https://issues.apache.org/jira/browse/HBASE-26109
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: release
> Fix For: 1.7.1
>
>
> - Set branch-1 version to 1.7.2-SNAPSHOT
> - Update SVN release bits, add download links to 1.7.1
> - Remove 1.7.0 from the mirrors
> - Commit HBASE-26071
> in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26109) Wrap up 1.7.1 release

2021-07-21 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26109:


 Summary: Wrap up 1.7.1 release
 Key: HBASE-26109
 URL: https://issues.apache.org/jira/browse/HBASE-26109
 Project: HBase
  Issue Type: Task
  Components: documentation
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada
 Fix For: 1.7.1


- Set branch-1 version to 1.7.2-SNAPSHOT
- Update SVN release bits, add download links to 1.7.1
- Remove 1.7.0 from the mirrors
- Commit HBASE-26071

in that order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-15 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada resolved HBASE-26075.
--
Fix Version/s: 1.7.1
   Resolution: Fixed

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
> Fix For: 1.7.1
>
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
>   Path head = queue.peek();
>   

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-15 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381772#comment-17381772
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

[~shahrs87] Can you please submit the PRs for master/branch-2 as a part of the 
child task? marking this resolved for 1.7.1 and the release work is in progress.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&

[jira] [Created] (HBASE-26093) Replication is stuck due to zero length wal file in oldWALs directory [master/branch-2]

2021-07-15 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26093:


 Summary: Replication is stuck due to zero length wal file in 
oldWALs directory [master/branch-2]
 Key: HBASE-26093
 URL: https://issues.apache.org/jira/browse/HBASE-26093
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 3.0.0-alpha-2, 2.4.5
Reporter: Bharath Vissapragada
Assignee: Rushabh Shah


Clone of the parent issue for branch-2/master. Resolving the parent as the 
issue is fixed in 1.7.1 and the jira needs to be in a resolved state for 
release notes curation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26083) L1 miss metric is incorrect when using CombinedBlockCache

2021-07-15 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381766#comment-17381766
 ] 

Bharath Vissapragada commented on HBASE-26083:
--

I just tagged RC0 moments back and triggered the release build which is running 
as I type this, lets include it in the next release?

> L1 miss metric is incorrect when using CombinedBlockCache
> -
>
> Key: HBASE-26083
> URL: https://issues.apache.org/jira/browse/HBASE-26083
> Project: HBase
>  Issue Type: Bug
>  Components: BlockCache
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Minor
> Fix For: 2.3.6, 3.0.0-alpha-2, 2.4.5
>
>
> In CombinedBlockCache getBlock method
> {code:java}
> public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching,
>   boolean repeat, boolean updateCacheMetrics) {
> // TODO: is there a hole here, or just awkwardness since in the lruCache 
> getBlock
> // we end up calling l2Cache.getBlock.
> // We are not in a position to exactly look at LRU cache or BC as 
> BlockType may not be getting
> // passed always.
> return l1Cache.containsBlock(cacheKey)?
> l1Cache.getBlock(cacheKey, caching, repeat, updateCacheMetrics):
> l2Cache.getBlock(cacheKey, caching, repeat, updateCacheMetrics);
>   }
> {code}
> If L1 does not contains cacheKey, it will never check it in L1, so that L1 
> miss metrics is always 0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-15 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381451#comment-17381451
 ] 

Bharath Vissapragada commented on HBASE-26042:
--

Ya scratch that. Now that I look closely, all the flushes are single threaded. 
Think a concurrent flush is not possible in this specific thread stack at least 
because this is done _before_ swapping the WAL writer reference (consume path 
can be ignored). I'm still staring at the code (from the perspective of a 
single flush not being ack-ed + channel close scenario and race between 
completed on one channel and failed on another) but don't have any theory yet. 

[~stack] Its probably easy to prove/disprove this theory/direction if you have 
a heap dump from that state? Poke around the call back its channel ID vs closed 
channel and the AsyncFSOutput instance state of the new WAL?


> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 

[jira] [Commented] (HBASE-25823) TestSlowLogAccessor.testHigherSlowLogs repeatable failure

2021-07-14 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380762#comment-17380762
 ] 

Bharath Vissapragada commented on HBASE-25823:
--

Still failing on branch-1 (2 of last 3 runs).

https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/147/
https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148/

Briefly looked at it, think the time based wait should be redone to something 
more deterministic based on when the records are inserted into that table. 

> TestSlowLogAccessor.testHigherSlowLogs repeatable failure
> -
>
> Key: HBASE-25823
> URL: https://issues.apache.org/jira/browse/HBASE-25823
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.4.5
>
>
> {noformat}
>  [ERROR] TestSlowLogAccessor.testHigherSlowLogs:211 Waiting timed out after 
> [7,000] msec{noformat}
> Repeatable failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26075:
-
Release Note: Handles 0 length WAL files moved to oldWALs directory so that 
they do not block the replication queue.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
>  

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378324#comment-17378324
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

[~shahrs87] Please submit a patch for master once you are freed up, merged this 
in branch-1 for now. Thanks.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) 

[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26070:
-
Release Note: Tables metadata created on a 1.7.0 cluster with faulty 
serialization from HBASE-26021 is now automatically rewritten with correct 
serialization format at active master bootstrap. Refer to hbase upgrade 
documentation on 1.7.x for further details.

> Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
> --
>
> Key: HBASE-26070
> URL: https://issues.apache.org/jira/browse/HBASE-26070
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>
> As discussed in the parent issue, 1.7.0 introduced an incompatible table 
> serialization change. We need tooling that auto converts these into 1.7.1 
> compatible table descriptors and table states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26021:
-
Release Note: Incompatible table metadata serialization that was added to 
1.7.0 is undone by this change. This faulty serialization causes upgrade issues 
from pre-1.7.0 versions to 1.7.0/2.x. Release 1.7.0 will be withdrawn from 
mirrors and the suggested upgrade version is 1.7.1+.

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: compatibility, incompatibility, serialization, upgrade
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at 

[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25130:
-
Release Note:   (was: Fixed on branch-1.)

> [branch-1] Masters in-memory serverHoldings map is not cleared during hbck 
> repair
> -
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 1.7.0
>Reporter: Sandeep Guggilam
>Assignee: Victor Li
>Priority: Major
> Fix For: 1.7.1
>
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25130:
-
Affects Version/s: 1.7.0

> [branch-1] Masters in-memory serverHoldings map is not cleared during hbck 
> repair
> -
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Sandeep Guggilam
>Assignee: Victor Li
>Priority: Major
> Fix For: 1.7.1
>
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25130:
-
Component/s: hbck

> [branch-1] Masters in-memory serverHoldings map is not cleared during hbck 
> repair
> -
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 1.7.0
>Reporter: Sandeep Guggilam
>Assignee: Victor Li
>Priority: Major
> Fix For: 1.7.1
>
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21674) Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from thrift1 server) to branch-1

2021-07-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378290#comment-17378290
 ] 

Bharath Vissapragada commented on HBASE-21674:
--

Copied the release note from the parent.

> Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from 
> thrift1 server) to branch-1
> 
>
> Key: HBASE-21674
> URL: https://issues.apache.org/jira/browse/HBASE-21674
> Project: HBase
>  Issue Type: Sub-task
>  Components: Thrift
>Reporter: Andrew Kyle Purtell
>Assignee: Yutong Xiao
>Priority: Major
> Fix For: 1.7.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-21674) Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from thrift1 server) to branch-1

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-21674:
-
Release Note: 

Before this issue, thrift1 server and thrift2 server are totally different 
servers. If a new feature is added to thrift1 server, thrfit2 server have to 
make the same change to support it(e.g. authorization). After this issue, 
thrift2 server is inherited from thrift1, thrift2 server now have all the 
features thrift1 server has(e.g http support, which thrift2 server doesn't have 
before). The way to start thrift1 or thrift2 server remain the same after this 
issue.

> Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from 
> thrift1 server) to branch-1
> 
>
> Key: HBASE-21674
> URL: https://issues.apache.org/jira/browse/HBASE-21674
> Project: HBase
>  Issue Type: Sub-task
>  Components: Thrift
>Reporter: Andrew Kyle Purtell
>Assignee: Yutong Xiao
>Priority: Major
> Fix For: 1.7.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-
Release Note: Fixes a WAL lockup issue due to premature reuse of the sync 
futures by the WAL consumers. The lockup causes the WAL system to hang 
resulting in blocked appends and syncs thus holding up the RPC handlers from 
progressing. Only workaround without this fix is to force abort the region 
server.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26074:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> --
>
> Key: HBASE-26074
> URL: https://issues.apache.org/jira/browse/HBASE-26074
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: regression
> Fix For: 1.7.1
>
>
> In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
> on branch-1 forever. 
> {noformat}
> Regression
> health checks / yetus jdk8 hadoop2 checks / 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego
> Failing for the past 1 build (Since Failed#145 )
> Took 0.79 sec.
> Error Message
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>  at java.net.SocketInputStream.read(SocketInputStream.java:210)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>  at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
>  at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.lang.Thread.run(Thread.java:748)
> Stacktrace
> java.lang.AssertionError: 
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:210)
>   at java.net.SocketInputStream.read(SocketInputStream.java:141)
>   at 

[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26074:
-
Status: Patch Available  (was: Open)

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> --
>
> Key: HBASE-26074
> URL: https://issues.apache.org/jira/browse/HBASE-26074
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: regression
> Fix For: 1.7.1
>
>
> In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
> on branch-1 forever. 
> {noformat}
> Regression
> health checks / yetus jdk8 hadoop2 checks / 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego
> Failing for the past 1 build (Since Failed#145 )
> Took 0.79 sec.
> Error Message
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>  at java.net.SocketInputStream.read(SocketInputStream.java:210)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>  at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
>  at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.lang.Thread.run(Thread.java:748)
> Stacktrace
> java.lang.AssertionError: 
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:210)
>   at java.net.SocketInputStream.read(SocketInputStream.java:141)
>   at 

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377533#comment-17377533
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

Hoping to get an RC out before end of this week (unless there are any 
unforeseen delays in the process), any chance this can be committed before 
then? I can help review the patch.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() 

[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377523#comment-17377523
 ] 

Bharath Vissapragada commented on HBASE-26074:
--

I figured out the issue, it appears the difference is in the SSL libraries 
shipped by open jdk (or oracle jdks, commonly used by the developers) and the 
azul JDKs used in the yetus test runs. If we carefully notice the bottom of the 
stack on the server side doing the SSL hand shakes, we can see the difference.

Open JDK's SSL libs has special handling that recognizes it is an HTTP 
connection.

{noformat}
2021-07-08 10:25:33,541 WARN  [984520532@qtp-1977083615-0] log.Slf4jLog(89): 
EXCEPTION 
javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
at 
sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:710)
at sun.security.ssl.InputRecord.read(InputRecord.java:527)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at 
org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2021-07-08 10:25:33,542 WARN  [1227853087@qtp-1977083615-2] log.Slf4jLog(89): 
EXCEPTION 
javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
at 
sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:710)
at sun.security.ssl.InputRecord.read(InputRecord.java:527)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at 
org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2021-07-08 10:25:33,543 INFO  [Time-limited test] log.Slf4jLog(67): Stopped 
SslSocketConnectorSecure@localhost:0
{noformat}

Whereas the azul JDK uses a different version of ssl lib that treats it as a 
generic SSL failure and resets the connection.

{noformat}
javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
at 
sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:440)
at 
sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:175)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1197)
at 
sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1106)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:398)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:370)
at 
org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2021-07-08 13:56:33,322 WARN  [189676137@qtp-1691391859-0] log.Slf4jLog(89): 
EXCEPTION 
javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
at 
sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:440)
at 
sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:175)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1197)
at 
sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1106)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:398)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:370)
at 
org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2021-07-08 13:56:33,323 INFO  [Time-limited test] log.Slf4jLog(67): Stopped 
SslSocketConnectorSecure@localhost:0
{noformat}

I downloaded the azul JDK from 
https://www.azul.com/downloads/?version=java-8-lts=macos=x86-64-bit=jdk=true
 (tests seem to  use `zulu8.48.0.51-ca-jdk8.0.26`) and then I can reproduce the 
same error on my mac book. Let me create a patch to generically assert the 
error. 

[~reidchan] FYI (long pending branch-1 unit test issue).

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> 

[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377499#comment-17377499
 ] 

Bharath Vissapragada commented on HBASE-26074:
--

Runs ok locally (mac), looks like the issue is only on CI.

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> --
>
> Key: HBASE-26074
> URL: https://issues.apache.org/jira/browse/HBASE-26074
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: regression
> Fix For: 1.7.1
>
>
> In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
> on branch-1 forever. 
> {noformat}
> Regression
> health checks / yetus jdk8 hadoop2 checks / 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego
> Failing for the past 1 build (Since Failed#145 )
> Took 0.79 sec.
> Error Message
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>  at java.net.SocketInputStream.read(SocketInputStream.java:210)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>  at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
>  at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.lang.Thread.run(Thread.java:748)
> Stacktrace
> java.lang.AssertionError: 
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:210)
>   at 

[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26074:
-
Labels: regression  (was: )

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> --
>
> Key: HBASE-26074
> URL: https://issues.apache.org/jira/browse/HBASE-26074
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: regression
> Fix For: 1.7.1
>
>
> In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
> on branch-1 forever. 
> {noformat}
> Regression
> health checks / yetus jdk8 hadoop2 checks / 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego
> Failing for the past 1 build (Since Failed#145 )
> Took 0.79 sec.
> Error Message
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>  at java.net.SocketInputStream.read(SocketInputStream.java:210)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>  at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
>  at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.lang.Thread.run(Thread.java:748)
> Stacktrace
> java.lang.AssertionError: 
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:210)
>   at java.net.SocketInputStream.read(SocketInputStream.java:141)
>   at 

[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377491#comment-17377491
 ] 

Bharath Vissapragada commented on HBASE-26074:
--

Sample jenkins job (that will expire soon anyway): 
https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/145/

> testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on 
> branch-1
> --
>
> Key: HBASE-26074
> URL: https://issues.apache.org/jira/browse/HBASE-26074
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>
> In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
> on branch-1 forever. 
> {noformat}
> Regression
> health checks / yetus jdk8 hadoop2 checks / 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego
> Failing for the past 1 build (Since Failed#145 )
> Took 0.79 sec.
> Error Message
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>  at java.net.SocketInputStream.read(SocketInputStream.java:210)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>  at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>  at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
>  at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
>  at 
> org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
>  at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
>  at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
>  at 
> org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.lang.Thread.run(Thread.java:748)
> Stacktrace
> java.lang.AssertionError: 
> Expected to find 'Unexpected end of file from server' but got unexpected 
> exception:java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:210)
>   at 

[jira] [Created] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1

2021-07-08 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26074:


 Summary: testLogLevelByHttps/testLogLevelByHttpsWithSpnego 
consistently failing on branch-1
 Key: HBASE-26074
 URL: https://issues.apache.org/jira/browse/HBASE-26074
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.7.0
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada
 Fix For: 1.7.1


In prep-ing for 1.7.1, I noticed that this test has been consistently failing 
on branch-1 forever. 

{noformat}
Regression
health checks / yetus jdk8 hadoop2 checks / 
org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego

Failing for the past 1 build (Since Failed#145 )
Took 0.79 sec.
Error Message
Expected to find 'Unexpected end of file from server' but got unexpected 
exception:java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:210)
 at java.net.SocketInputStream.read(SocketInputStream.java:141)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
 at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
 at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189)
 at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347)
 at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268)
 at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284)
 at org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227)
 at 
org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123)
 at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107)
 at 
org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349)
 at 
org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68)
 at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325)
 at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
 at 
org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322)
 at 
org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277)
 at 
org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.lang.Thread.run(Thread.java:748)
Stacktrace
java.lang.AssertionError: 
Expected to find 'Unexpected end of file from server' but got unexpected 
exception:java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702)
at 

[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables

2021-07-07 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26070:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks [~reidchan].

> Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
> --
>
> Key: HBASE-26070
> URL: https://issues.apache.org/jira/browse/HBASE-26070
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>
> As discussed in the parent issue, 1.7.0 introduced an incompatible table 
> serialization change. We need tooling that auto converts these into 1.7.1 
> compatible table descriptors and table states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376818#comment-17376818
 ] 

Bharath Vissapragada commented on HBASE-26042:
--

bq.  I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially 
around when a new writer is marked 'BROKEN' and a flush is called resulting in 
some waitingAckQueue members not being cleaned up correctly. Just a theory at 
this point, but probably easy to poke around with a heap dump or some related 
logging around these code paths.

[~stack] Attached a unit test repro patch (which includes comments that explain 
the race). I followed the theory of broken cb cleanup in 
FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can 
run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break 
point at latch#await() to inspect the state. WDYT.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, 

[jira] [Comment Edited] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376818#comment-17376818
 ] 

Bharath Vissapragada edited comment on HBASE-26042 at 7/7/21, 8:21 PM:
---

bq.  I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially 
around when a new writer is marked 'BROKEN' and a flush is called resulting in 
some waitingAckQueue members not being cleaned up correctly. Just a theory at 
this point, but probably easy to poke around with a heap dump or some related 
logging around these code paths.

[~stack] Attached a unit test repro patch (on branch-2.3) (which includes 
comments that explain the race). I followed the theory of broken cb cleanup in 
FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can 
run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break 
point at latch#await() to inspect the state. WDYT.


was (Author: bharathv):
bq.  I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially 
around when a new writer is marked 'BROKEN' and a flush is called resulting in 
some waitingAckQueue members not being cleaned up correctly. Just a theory at 
this point, but probably easy to poke around with a heap dump or some related 
logging around these code paths.

[~stack] Attached a unit test repro patch (which includes comments that explain 
the race). I followed the theory of broken cb cleanup in 
FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can 
run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break 
point at latch#await() to inspect the state. WDYT.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26042:
-
Attachment: HBASE-26042-test-repro.patch

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> (Compiled frame)
>  

[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376713#comment-17376713
 ] 

Bharath Vissapragada commented on HBASE-25761:
--

No worries, looking forward to the meeting notes. 

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376688#comment-17376688
 ] 

Bharath Vissapragada commented on HBASE-25761:
--

[~stack] No sir, my appointment is 3pm-5pm Pacific, hence was suggesting 6pm. I 
can either do today 6pm (which doesn't work for Duo) or any time tomorrow 
(Thursday evening pacific / Friday morning China). 

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376663#comment-17376663
 ] 

Bharath Vissapragada commented on HBASE-25761:
--

bq. Shout if it works for you and I'll update the dev list. Thanks.

Unlikely that I'll be home before 6pm with the traffic. I'll join once I'm back 
but please continue with the original schedule. I'll catch up using the meeting 
notes. Looking forward to the progress in this work and happy to contribute to 
it.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-06 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376138#comment-17376138
 ] 

Bharath Vissapragada commented on HBASE-25761:
--

If 6pm Pacific (9am China) works for everyone, prefer that. If not please 
continue with the original schedule, I'll join when I'm back. Thanks.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-06 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376121#comment-17376121
 ] 

Bharath Vissapragada commented on HBASE-25761:
--

[~stack] I think the at-mention was for me (looks like an incorrect tag, sorry 
if its not). Either way, I have an appointment until 5pm Pacific and then I 
need to drive back, so may be able to join 6pm if folks are still around. I can 
catch up later via some meeting notes/summary, don't want the meeting to me 
moved because of me. Thanks (looking forward to the meeting outcome _smile_).

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-07-06 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376119#comment-17376119
 ] 

Bharath Vissapragada commented on HBASE-26021:
--

[PR|https://github.com/apache/hbase/pull/3463] for 1.7.0 table fixups, any 
feedback appreciated.

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: compatibility, incompatibility, serialization, upgrade
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> 

[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables

2021-07-06 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26070:
-
Status: Patch Available  (was: Open)

> Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
> --
>
> Key: HBASE-26070
> URL: https://issues.apache.org/jira/browse/HBASE-26070
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 1.7.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
>
> As discussed in the parent issue, 1.7.0 introduced an incompatible table 
> serialization change. We need tooling that auto converts these into 1.7.1 
> compatible table descriptors and table states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-07-06 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375827#comment-17375827
 ] 

Bharath Vissapragada commented on HBASE-26021:
--

Created subtasks for the documentation and serialization fixups. I think it is 
unlikely that anyone upgraded to 1.7.0 (with existing tables) because they'd 
run into this problem. However agree that there could be users who installed 
1.7.0 from scratch or upgraded an empty cluster to 1.7.0 and then created 
tables with the faulty serialization. Let me see if there is an easy way to fix 
them up at runtime (with some caveats).

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: compatibility, incompatibility, serialization, upgrade
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at 

[jira] [Created] (HBASE-26071) Document HBASE-26070 and upgrade considerations for 1.7.0/1.7.1

2021-07-06 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26071:


 Summary: Document HBASE-26070 and upgrade considerations for 
1.7.0/1.7.1
 Key: HBASE-26071
 URL: https://issues.apache.org/jira/browse/HBASE-26071
 Project: HBase
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.7.0
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada
 Fix For: 1.7.1


Document the incompatible serialization problem and upgrade paths for users 
that are on 1.7.0 and prior releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables

2021-07-06 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-26070:


 Summary: Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 
compatible tables
 Key: HBASE-26070
 URL: https://issues.apache.org/jira/browse/HBASE-26070
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 1.7.0
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada
 Fix For: 1.7.1


As discussed in the parent issue, 1.7.0 introduced an incompatible table 
serialization change. We need tooling that auto converts these into 1.7.1 
compatible table descriptors and table states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-01 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372926#comment-17372926
 ] 

Bharath Vissapragada commented on HBASE-26042:
--

Thanks for the jstacks, I think consume is not scheduled further because the 
sync is broken and that is not clearing up the ring buffer. To me the most 
suspicious stack is the following hung flush thread.

{noformat}
Thread 9464: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=175 (Compiled frame)
 - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, line=1707 
(Compiled frame)
 - 
java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
 @bci=119, line=3323 (Compiled frame)
 - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
line=1742 (Compiled frame)
 - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
frame)
 - 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
 @bci=16, line=189 (Compiled frame)
 - 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
 org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) @bci=9, 
line=202 (Compiled frame)
{noformat}

Given the state of this thread, I think the new writer instance (from the roll) 
is also broken for some reason and some of the callbacks (one of which is the 
future from the above thread) are not cleaned up correctly. I think there is 
some racy code in FanOutOneBlockAsyncDFSOutput especially around when a new 
writer is marked 'BROKEN' and a flush is called resulting in some 
waitingAckQueue members not being cleaned up correctly. Just a theory at this 
point, but probably easy to poke around with a heap dump or some related 
logging around these code paths.


> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-06-30 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372080#comment-17372080
 ] 

Bharath Vissapragada commented on HBASE-26042:
--

Any chance you can attach a full jstack? Thanks.

[~vjasani] [~dmanning] [~apurtell] FYI.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 

[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-06-29 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26021:
-
Component/s: master

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> 

[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-06-29 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26021:
-
Labels: compatibility incompatibility serialization upgrade  (was: )

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
>  Labels: compatibility, incompatibility, serialization, upgrade
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> 

[jira] [Resolved] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-06-29 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada resolved HBASE-26021.
--
Fix Version/s: 1.7.1
   Resolution: Fixed

Merged the change, thanks for the reviews. Spoke with [~vjasani] offline, he 
mentioned he will confirm whether this patch works for him in his local test 
setup.

I volunteer to do a 1.7.0.1 if the patch works for Viraj. (unless there are any 
objections).

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 1.7.1
>
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> 

[jira] [Commented] (HBASE-26029) It is not reliable to use nodeDeleted event to track region server's death

2021-06-28 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371044#comment-17371044
 ] 

Bharath Vissapragada commented on HBASE-26029:
--

Hooking this up with SCP is neat, should clean up all the dead server 
management code from ReplicationSourceManager, +1.

> It is not reliable to use nodeDeleted event to track region server's death
> --
>
> Key: HBASE-26029
> URL: https://issues.apache.org/jira/browse/HBASE-26029
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, Zookeeper
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
>
> When implementing HBASE-26011, [~sunxin] pointed out an interesting scenario, 
> where a region server up and down between two sync requests, then we can not 
> know the death of the region server.
> https://github.com/apache/hbase/pull/3405#discussion_r656720923
> This is a valid point, and when thinking of a solution, I noticed that, the 
> current zk iplementation has the same problem. Notice that, a watcher on zk 
> can only be triggered once, so after zk triggers the watcher, and before you 
> set a new watcher, it is possible that a region server is up and down, and 
> you will miss the nodeDeleted event for this region server.
> I think, the general approach here, which could works for both master based 
> and zk based replication tracker is that, we should not rely on the tracker 
> to tell you which region server is dead. Instead, we just provide the list of 
> live regionservers, and the upper layer should compare this list with the 
> expected list(for replication, the list should be gotten by listing 
> replicators), to detect the dead region servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-06-28 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371034#comment-17371034
 ] 

Bharath Vissapragada commented on HBASE-26021:
--

https://github.com/apache/hbase/pull/3435 I did a quick revert in the interest 
of time ([~vjasani] hope its ok). It was a bit involved due to the other piled 
up changes on top of it. I put it up for review to see how jenkins reacts to 
it. I think it could use some additional test coverage for the new interface 
methods.

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Priority: Major
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> 

[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization

2021-06-27 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370424#comment-17370424
 ] 

Bharath Vissapragada commented on HBASE-26021:
--

Sorry, late to the party, I was OOO this week and couldn't access jira, 
catching up on the latest discussion. Agree that 1.7.0 is broken and we should 
spin up 1.7.0.1 ([~reidchan] I can also help create the release if you are 
busy, let me know, there is another critical fix HBASE-25984 that I recently 
committed to branch-1 that is worthy of inclusion).

Coming to the fix, isn't the patch incomplete ? If we just serialize HTD 
instead of TD (=HTD + table state), there is loss of information, right? We are 
just not seeing it in tests because we cache the table state in 
TableStateManager, so something like disable table, stop hbase, start hbase 
should result in an enabled table? Let's properly revert this patch instead? 
(We may need to add some special code to handle serialized TDs for those who 
are already on 1.7.0?)

> HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Priority: Major
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase meta*" and "*\n hbase namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
>

[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue

2021-06-23 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368469#comment-17368469
 ] 

Bharath Vissapragada commented on HBASE-26021:
--

[~vjasani] Thanks for digging. I suspected that commit could cause some edge 
cases like this. The commit in it's exact form is not needed for the work IIRC 
(it diverged the code quite a bit, so had to backport it), so we can undo 
certain parts of the serialization that is incompatible with branch-1. You 
happen to dig into what exactly is off in the serialization piece?

> HBase 1.7 to 2.4 upgrade issue
> --
>
> Key: HBASE-26021
> URL: https://issues.apache.org/jira/browse/HBASE-26021
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.0, 2.4.4
>Reporter: Viraj Jasani
>Priority: Major
> Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot 
> 2021-06-22 at 12.54.30 PM.png
>
>
> As of today, if we bring up HBase cluster using branch-1 and upgrade to 
> branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. 
> Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil 
> seems to be producing "*\n hbase:\n meta*" and "*\n hbase:\n namespace*"
> {code:java}
> 2021-06-22 00:05:56,611 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> regionserver.RSRpcServices: Open hbase:meta,,1.1588230740
> 2021-06-22 00:05:56,648 INFO  
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> regionserver.RSRpcServices: Open 
> hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a.
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R   namespace
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597)
> at 
> org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482)
> at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> 2021-06-22 00:05:56,759 ERROR 
> [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] 
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: Illegal character <
> > at 0. Namespaces may only contain 'alphanumeric characters' from any 
> > language and digits:
> ^Ehbase^R^Dmeta
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246)
> at 
> org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220)
> at org.apache.hadoop.hbase.TableName.(TableName.java:348)
> at 
> org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385)
> at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292)
>   

[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-18 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25998:
-
Fix Version/s: 2.4.5
   2.3.6
   2.5.0
   3.0.0-alpha-1
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.5
>
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-18 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-18 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-
Fix Version/s: 2.4.5
   1.7.1
   2.3.6
   2.5.0
   3.0.0-alpha-1

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-17 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365060#comment-17365060
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

Back port PRs are WIP (PR - jira auto link isn't working, it will catch up 
soon).

https://github.com/apache/hbase/pull/3392
https://github.com/apache/hbase/pull/3393
https://github.com/apache/hbase/pull/3394

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-17 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365057#comment-17365057
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

FSHLog doesn't show much improvement in WALPE with the patch, so I believe that 
is reflected in the YCSB runs too. Unfortunately I'm not able to deploy a 
branch-2 cluster right now (without much effort) to get the async WAL numbers. 
I will update here once I have a cluster and some data.

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-16 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364520#comment-17364520
 ] 

Bharath Vissapragada edited comment on HBASE-25998 at 6/16/21, 9:21 PM:


Thanks [~apurtell] for trying out the patch (and review).

One interesting behavior here is that this big throughput difference is only 
obvious for Async WAL implementation, not clear to me why, perhaps there is a 
lot more contention in that implementation for some reason. I repeated the same 
set of tests in branch-1/master based FSHLog and the patch only performs 
slightly better (few single digit % points). This behavior was also confirmed 
in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster).

Without patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 199938
[OVERALL], Throughput(ops/sec), 50015.50480649001
[TOTAL_GCS_PS_Scavenge], Count, 293
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603
[TOTAL_GCs], Count, 294
[TOTAL_GC_TIME], Time(ms), 1256
[TOTAL_GC_TIME_%], Time(%), 0.6281947403695145
[CLEANUP], Operations, 512
[CLEANUP], AverageLatency(us), 41.0234375
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 18527
[CLEANUP], 95thPercentileLatency(us), 13
[CLEANUP], 99thPercentileLatency(us), 37
[INSERT], Operations, 1000
[INSERT], AverageLatency(us), 5085.9494093
[INSERT], MinLatency(us), 1499
[INSERT], MaxLatency(us), 220927
[INSERT], 95thPercentileLatency(us), 6511
[INSERT], 99thPercentileLatency(us), 16655
[INSERT], Return=OK, 1000
{noformat}
With patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 195064
[OVERALL], Throughput(ops/sec), 51265.2257720543
[TOTAL_GCS_PS_Scavenge], Count, 284
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.01691752450477792
[TOTAL_GCs], Count, 285
[TOTAL_GC_TIME], Time(ms), 1217
[TOTAL_GC_TIME_%], Time(%), 0.6238977976459008
[CLEANUP], Operations, 512
[CLEANUP], AverageLatency(us), 45.783203125
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 20591
[CLEANUP], 95thPercentileLatency(us), 14
[CLEANUP], 99thPercentileLatency(us), 37
[INSERT], Operations, 1000
[INSERT], AverageLatency(us), 4958.6662675
[INSERT], MinLatency(us), 1380
[INSERT], MaxLatency(us), 295935
[INSERT], 95thPercentileLatency(us), 6335
[INSERT], 99thPercentileLatency(us), 19071
[INSERT], Return=OK, 1000
{noformat}
Unfortunately, the tooling I have does not support branch-2/master (yet) so 
that I can repeat this YCSB run for Async WAL implementation but if WALPE runs 
are any indication, there should be a good enough throughput improvement.


was (Author: bharathv):
Thanks [~apurtell] for trying out the patch (and review).

One interesting behavior here is that this big throughput difference is only 
obvious for Async WAL implementation, not clear to me why, perhaps there is a 
lot more contention in that implementation for some reason. I repeated the same 
set of tests in branch-1/master based FSHLog and the patch only performs 
slightly better (few single digit % points). This behavior was also confirmed 
in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster).

Without patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 199938
[OVERALL], Throughput(ops/sec), 50015.50480649001
[TOTAL_GCS_PS_Scavenge], Count, 293
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603
[TOTAL_GCs], Count, 294
[TOTAL_GC_TIME], Time(ms), 1256
[TOTAL_GC_TIME_%], Time(%), 0.6281947403695145
[CLEANUP], Operations, 512
[CLEANUP], AverageLatency(us), 41.0234375
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 18527
[CLEANUP], 95thPercentileLatency(us), 13
[CLEANUP], 99thPercentileLatency(us), 37
[INSERT], Operations, 1000
[INSERT], AverageLatency(us), 5085.9494093
[INSERT], MinLatency(us), 1499
[INSERT], MaxLatency(us), 220927
[INSERT], 95thPercentileLatency(us), 6511
[INSERT], 99thPercentileLatency(us), 16655
[INSERT], Return=OK, 1000
{noformat}
With patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 195064
[OVERALL], Throughput(ops/sec), 51265.2257720543
[TOTAL_GCS_PS_Scavenge], Count, 284
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 

[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-16 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364520#comment-17364520
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

Thanks [~apurtell] for trying out the patch (and review).

One interesting behavior here is that this big throughput difference is only 
obvious for Async WAL implementation, not clear to me why, perhaps there is a 
lot more contention in that implementation for some reason. I repeated the same 
set of tests in branch-1/master based FSHLog and the patch only performs 
slightly better (few single digit % points). This behavior was also confirmed 
in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster).

Without patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 199938
[OVERALL], Throughput(ops/sec), 50015.50480649001
[TOTAL_GCS_PS_Scavenge], Count, 293
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603
[TOTAL_GCs], Count, 294
[TOTAL_GC_TIME], Time(ms), 1256
[TOTAL_GC_TIME_%], Time(%), 0.6281947403695145
[CLEANUP], Operations, 512
[CLEANUP], AverageLatency(us), 41.0234375
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 18527
[CLEANUP], 95thPercentileLatency(us), 13
[CLEANUP], 99thPercentileLatency(us), 37
[INSERT], Operations, 1000
[INSERT], AverageLatency(us), 5085.9494093
[INSERT], MinLatency(us), 1499
[INSERT], MaxLatency(us), 220927
[INSERT], 95thPercentileLatency(us), 6511
[INSERT], 99thPercentileLatency(us), 16655
[INSERT], Return=OK, 1000
{noformat}
With patch: branch-1/FSHLog (10M ingest only)
{noformat}
[OVERALL], RunTime(ms), 195064
[OVERALL], Throughput(ops/sec), 51265.2257720543
[TOTAL_GCS_PS_Scavenge], Count, 284
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229
[TOTAL_GCS_PS_MarkSweep], Count, 1
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.01691752450477792
[TOTAL_GCs], Count, 285
[TOTAL_GC_TIME], Time(ms), 1217
[TOTAL_GC_TIME_%], Time(%), 0.6238977976459008
[CLEANUP], Operations, 512
[CLEANUP], AverageLatency(us), 45.783203125
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 20591
[CLEANUP], 95thPercentileLatency(us), 14
[CLEANUP], 99thPercentileLatency(us), 37
[INSERT], Operations, 1000
[INSERT], AverageLatency(us), 4958.6662675
[INSERT], MinLatency(us), 1380
[INSERT], MaxLatency(us), 295935
[INSERT], 95thPercentileLatency(us), 6335
[INSERT], 99thPercentileLatency(us), 19071
[INSERT], Return=OK, 1000
{noformat}
Unfortunately, the tooling I have does not support branch-2/master (yet) so 
that I can repeat this YCSB run for Async WAL implementation but if WALPE runs 
are any indication, we should be a good enough throughput improvement.

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-13 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362670#comment-17362670
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

[~stack] Thanks for taking a look, test runs seem fine, trying to do an e-e 
throughput test on a cluster, will report the results here.

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25998:
-
Status: Patch Available  (was: Open)

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361973#comment-17361973
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

Redid the experiments with JDK-11 (to account for any latest monitor 
performance enhancements) and I see similar numbers. Also, the numbers above 
are for {{-t 256}} which implies heavy contention. It seems like the patch 
performs well under heavy load and the gap narrows with fewer threads (which I 
guess is expected), but even with very low concurrency the patch seems to out 
perform the current state.

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361867#comment-17361867
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

[~zhangduo] [~apurtell] [~stack] might of interest to you (draft patch up for 
review), results seem too good to be true. If you don't mind trying the patch 
locally in your environment (just want to eliminate any noise from my end)..  
PTAL.

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361864#comment-17361864
 ] 

Bharath Vissapragada commented on HBASE-25998:
--

{noformat}
java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
{noformat}


For default WAL provider (async WAL) 

Without Patch

{noformat}
-- Histograms --
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.latencyHistogram.nanos
 count = 10271257
   min = 2672827
   max = 67700701
  mean = 4084532.41
stddev = 6244597.80
median = 3403047.00
  75% <= 3525394.00
  95% <= 3849268.00
  98% <= 4319378.00
  99% <= 61134500.00
99.9% <= 67195663.00
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncCountHistogram.countPerSync
 count = 100888
   min = 52
   max = 103
  mean = 101.91
stddev = 2.09
median = 102.00
  75% <= 102.00
  95% <= 102.00
  98% <= 102.00
  99% <= 103.00
99.9% <= 103.00
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncHistogram.nanos-between-syncs
 count = 100889
   min = 119051
   max = 62778058
  mean = 1601305.10
stddev = 3626948.72
median = 1361530.00
  75% <= 1407052.00
  95% <= 1523418.00
  98% <= 1765310.00
  99% <= 2839178.00
99.9% <= 62778058.00

-- Meters --
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.appendMeter.bytes
 count = 5721241096
 mean rate = 37890589.06 events/second
 1-minute rate = 36390169.75 events/second
 5-minute rate = 33524039.88 events/second
15-minute rate = 31915066.49 events/second
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncMeter.syncs
 count = 100889
 mean rate = 668.16 events/second
 1-minute rate = 641.77 events/second
 5-minute rate = 590.37 events/second
15-minute rate = 561.67 events/second
{noformat}

With patch:

{noformat}
-- Histograms --
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.latencyHistogram.nanos
 count = 12927042
   min = 943723
   max = 60827209
  mean = 1865217.32
stddev = 5384907.53
median = 1323691.00
  75% <= 1443195.00
  95% <= 1765866.00
  98% <= 1921920.00
  99% <= 3144643.00
99.9% <= 60827209.00
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncCountHistogram.countPerSync
 count = 126797
   min = 52
   max = 104
  mean = 101.87
stddev = 2.54
median = 102.00
  75% <= 102.00
  95% <= 102.00
  98% <= 103.00
  99% <= 103.00
99.9% <= 103.00
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncHistogram.nanos-between-syncs
 count = 126798
   min = 122666
   max = 60703608
  mean = 711847.31
stddev = 3174375.63
median = 519092.00
  75% <= 570240.00
  95% <= 695175.00
  98% <= 754972.00
  99% <= 791139.00
99.9% <= 59975393.00

-- Meters --
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.appendMeter.bytes
 count = 7200681555
 mean rate = 79170095.16 events/second
 1-minute rate = 75109969.27 events/second
 5-minute rate = 66505621.40 events/second
15-minute rate = 63719949.74 events/second
org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncMeter.syncs
 count = 126800
 mean rate = 1394.11 events/second
 1-minute rate = 1322.31 events/second
 5-minute rate = 1169.99 events/second
15-minute rate = 1120.69 events/second
{noformat}



> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, 

[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25998:
-
Attachment: monitor-overhead-2.png
monitor-overhead-1.png

> Revisit synchronization in SyncFuture
> -
>
> Key: HBASE-25998
> URL: https://issues.apache.org/jira/browse/HBASE-25998
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: monitor-overhead-1.png, monitor-overhead-2.png
>
>
> While working on HBASE-25984, I noticed some weird frames in the flame graphs 
> around monitor entry exit consuming a lot of CPU cycles (see attached 
> images). Noticed that the synchronization there is too coarse grained and 
> sometimes unnecessary. I did a simple patch that switched to a reentrant lock 
> based synchronization with condition variable rather than a busy wait and 
> that showed 70-80% increased throughput in WAL PE. Seems too good to be 
> true.. (more details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25998) Revisit synchronization in SyncFuture

2021-06-11 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-25998:


 Summary: Revisit synchronization in SyncFuture
 Key: HBASE-25998
 URL: https://issues.apache.org/jira/browse/HBASE-25998
 Project: HBase
  Issue Type: Improvement
  Components: Performance, regionserver, wal
Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada


While working on HBASE-25984, I noticed some weird frames in the flame graphs 
around monitor entry exit consuming a lot of CPU cycles (see attached images). 
Noticed that the synchronization there is too coarse grained and sometimes 
unnecessary. I did a simple patch that switched to a reentrant lock based 
synchronization with condition variable rather than a busy wait and that showed 
70-80% increased throughput in WAL PE. Seems too good to be true.. (more 
details in the comments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-06-10 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361144#comment-17361144
 ] 

Bharath Vissapragada commented on HBASE-25924:
--

[~shahrs87] I think it was committed and reverted from 2.3, still would be nice 
to have a working patch in that branch (if you have spare cycles) :-).

> Seeing a spike in uncleanlyClosedWALs metric.
> -
>
> Key: HBASE-25924
> URL: https://issues.apache.org/jira/browse/HBASE-25924
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1
>
>
> Getting the following log line in all of our production clusters when 
> WALEntryStream is dequeuing WAL file.
> {noformat}
>  2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - 
> Reached the end of WAL file hdfs://. It was not closed 
> cleanly, so we did not parse 8 bytes of data. This is normally ok.
> {noformat}
> The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + 
> "LAWP" (4 bytes) = 8 bytes)
> While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
> [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]
> {code:java}
>   private void tryAdvanceEntry() throws IOException {
> if (checkReader()) {
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) { // no more entries in this log file - see 
> if log was rolled
> if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
>   // Before dequeueing, we should always get one more attempt at 
> reading.
>   // This is in case more entries came in after we opened the reader,
>   // and a new log was enqueued while we were reading. See HBASE-6758
>   resetReader(); ---> HERE
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) {
> if (checkAllBytesParsed()) { // now we're certain we're done with 
> this log file
>   dequeueCurrentLog();
>   if (openNextLog()) {
> readNextEntryAndSetPosition();
>   }
> }
>   }
> } // no other logs, we've simply hit the end of the current open log. 
> Do nothing
>   }
> }
> // do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
> queue)
>   }
> {code}
> In resetReader, we call the following methods, WALEntryStream#resetReader  
> >  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
> In ProtobufLogReader#initInternal, we try to create the whole reader object 
> from scratch to see if any new data has been written.
> We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
> We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360383#comment-17360383
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

After reading the AsyncFSWAL implementation, I think the overwrites of the 
futures are possible but it does not cause a deadlock because of the way the 
safe point is attained. I uploaded a draft patch that removes the usage of 
ThreadLocals for both the WAL implementations. It only matters for the FSHLog 
implementation but in general it seems risky to use ThreadLocals that are prone 
to overwrites. [~zhangduo] Any thoughts?

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359575#comment-17359575
 ] 

Bharath Vissapragada edited comment on HBASE-25984 at 6/8/21, 9:39 PM:
---

Obtained a heap dump and poked around quite a bit and observed that in-memory 
state of the {{RingBufferEventHandler}} is corrupted. After analyzing it and 
eyeballing the code (long story short), came up with the following theory...

This is the current implementation of SyncFuture#reset()
{noformat}
1  synchronized SyncFuture reset(final long sequence, Span span) {
2if (t != null && t != Thread.currentThread()) throw new 
IllegalStateException();
3t = Thread.currentThread();
4if (!isDone()) throw new IllegalStateException("" + sequence + " " + 
Thread.currentThread());
5this.doneSequence = NOT_DONE;
6this.ringBufferSequence = sequence;
7this.span = span;
8this.throwable = null;
9return this;
10 }
{noformat}
We can see, there are guards against overwriting un-finished sync futures with 
checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, 
consider the following sequence of interleaved actions that can result in a 
deadlock..
 # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to 
RingBuffer - state: NOT_DONE
 # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, 
new state: DONE (via releaseSyncFuture() in SyncRunner.run())
 # RPCHandler#1 (reused for another region operation) → ThreadLocal 
SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now 
returns true and L2 always returns true because we reuse the ThreadLocal 
instance*).
 #  RingBufferConsumer#onEvent() - attainSafePoint() -> works on the 
overwritten sync future state (from step 3)
 #   DEADLOCK === (as the sync future remains in the state forever)

The problem here is that once a SyncFuture is marked DONE, it is eligible for 
reuse. Even though ring buffer still has references to it and is checking on it 
to attain a safe point, in the background another handler can just overwrite it 
resulting in ring buffer operating on a wrong future and deadlocking the 
system. Very subtle bug.. I'm able to reproduce this in a carefully crafted 
unit test (attached). I have a fix solves this problem, will create a PR for 
review soon.


was (Author: bharathv):
Obtained a heap dump and poked around quite a bit and observed that in-memory 
state is corrupted. After analyzing it and eyeballing the code (long story 
short), came up with the following theory...

This is the current implementation of SyncFuture#reset()
{noformat}
1  synchronized SyncFuture reset(final long sequence, Span span) {
2if (t != null && t != Thread.currentThread()) throw new 
IllegalStateException();
3t = Thread.currentThread();
4if (!isDone()) throw new IllegalStateException("" + sequence + " " + 
Thread.currentThread());
5this.doneSequence = NOT_DONE;
6this.ringBufferSequence = sequence;
7this.span = span;
8this.throwable = null;
9return this;
10 }
{noformat}
We can see, there are guards against overwriting un-finished sync futures with 
checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, 
consider the following sequence of interleaved actions that can result in a 
deadlock..
 # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to 
RingBuffer - state: NOT_DONE
 # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, 
new state: DONE (via releaseSyncFuture() in SyncRunner.run())
 # RPCHandler#1 (reused for another region operation) → ThreadLocal 
SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now 
returns true and L2 always returns true because we reuse the ThreadLocal 
instance*).
 #  RingBufferConsumer#onEvent() - attainSafePoint() -> works on the 
overwritten sync future state (from step 3)
 #   DEADLOCK === (as the sync future remains in the state forever)


The problem here is that once a SyncFuture is marked DONE, it is eligible for 
reuse. Even though ring buffer still has references to it and is checking on it 
to attain a safe point, in the background another handler can just overwrite it 
resulting in ring buffer operating on a wrong future and deadlocking the 
system. Very subtle bug.. I'm able to reproduce this in a carefully crafted 
unit test (attached). I have a fix solves this problem, will create a PR for 
review soon.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>   

[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-
Attachment: HBASE-25984-unit-test.patch

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359579#comment-17359579
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

This has been a recurring problem for us under load deadlocking a bunch of 
region servers and severely dropping the ingestion throughput until our probes 
trigger a force abort of the RS pods and the throughput recovers. I cannot 
share the heap dump for obvious reasons but can share any other details that 
reviewers might be interested in.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359575#comment-17359575
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

Obtained a heap dump and poked around quite a bit and observed that in-memory 
state is corrupted. After analyzing it and eyeballing the code (long story 
short), came up with the following theory...

This is the current implementation of SyncFuture#reset()
{noformat}
1  synchronized SyncFuture reset(final long sequence, Span span) {
2if (t != null && t != Thread.currentThread()) throw new 
IllegalStateException();
3t = Thread.currentThread();
4if (!isDone()) throw new IllegalStateException("" + sequence + " " + 
Thread.currentThread());
5this.doneSequence = NOT_DONE;
6this.ringBufferSequence = sequence;
7this.span = span;
8this.throwable = null;
9return this;
10 }
{noformat}
We can see, there are guards against overwriting un-finished sync futures with 
checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, 
consider the following sequence of interleaved actions that can result in a 
deadlock..
 # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to 
RingBuffer - state: NOT_DONE
 # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, 
new state: DONE (via releaseSyncFuture() in SyncRunner.run())
 # RPCHandler#1 (reused for another region operation) → ThreadLocal 
SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now 
returns true and L2 always returns true because we reuse the ThreadLocal 
instance*).
 #  RingBufferConsumer#onEvent() - attainSafePoint() -> works on the 
overwritten sync future state (from step 3)
 #   DEADLOCK === (as the sync future remains in the state forever)


The problem here is that once a SyncFuture is marked DONE, it is eligible for 
reuse. Even though ring buffer still has references to it and is checking on it 
to attain a safe point, in the background another handler can just overwrite it 
resulting in ring buffer operating on a wrong future and deadlocking the 
system. Very subtle bug.. I'm able to reproduce this in a carefully crafted 
unit test (attached). I have a fix solves this problem, will create a PR for 
review soon.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> 

[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-
Labels: deadlock hang  (was: )

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-25984:


 Summary: FSHLog WAL lockup with sync future reuse [RS deadlock]
 Key: HBASE-25984
 URL: https://issues.apache.org/jira/browse/HBASE-25984
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
Reporter: Bharath Vissapragada
Assignee: Bharath Vissapragada


We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
we noticed the WAL system gets locked up due to a subtle bug involving racy 
code with sync future reuse. This bug applies to all FSHLog implementations 
across branches.

Symptoms:

On heavily loaded clusters with large write load we noticed that the region 
servers are hanging abruptly with filled up handler queues and stuck MVCC 
indicating appends/syncs not making any progress.

{noformat}
 WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=172383686, 
writePoint=172383690, regionName=1ce4003ab60120057734ffe367667dca}
 WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=171504376, 
writePoint=171504381, regionName=7c441d7243f9f504194dae6bf2622631}
{noformat}

All the handlers are stuck waiting for the sync futures and timing out.

{noformat}
 java.lang.Object.wait(Native Method)
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)

org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
.
{noformat}

Log rolling is stuck because it was unable to attain a safe point

{noformat}
   java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
{noformat}

and the Ring buffer consumer thinks that there are some outstanding syncs that 
need to finish..

{noformat}
  
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)

org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)

org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
{noformat}

On the other hand, SyncRunner threads are idle and just waiting for work 
implying that there are no pending SyncFutures that need to be run

{noformat}
   sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
java.lang.Thread.run(Thread.java:748)
{noformat}

Overall the WAL system is dead locked and could make no progress until it was 
aborted. I got to the bottom of this issue and have a patch that can fix it 
(more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358951#comment-17358951
 ] 

Bharath Vissapragada commented on HBASE-24440:
--

Thanks for the detailed results.

> These implementations were microbenchmarked to identify the superior option, 
> which seems to be BoundedIncrementYieldAdvancingClock.

Kind of surprised. I guessed it would be IncrementAdvancingClock theoretically 
(since there is no overhead to maintain boundedness), still trying to wrap my 
head around the subtle details in the patch.

> We are tracking whether or not a row has already been committed to in the 
> current clock tick. Or, we are tracking per region a set of rows per clock 
> tick. Why not track that explicitly?

This is an interesting idea, trying to think about it a bit more. IIUC the idea 
here is to maximize the commit throughput as much as possible in a single tick 
by picking from the available pool of disjoint row keys that can share the same 
tick. Let me take a look at the patch..

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada resolved HBASE-25924.
--
Resolution: Fixed

> Seeing a spike in uncleanlyClosedWALs metric.
> -
>
> Key: HBASE-25924
> URL: https://issues.apache.org/jira/browse/HBASE-25924
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1
>
>
> Getting the following log line in all of our production clusters when 
> WALEntryStream is dequeuing WAL file.
> {noformat}
>  2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - 
> Reached the end of WAL file hdfs://. It was not closed 
> cleanly, so we did not parse 8 bytes of data. This is normally ok.
> {noformat}
> The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + 
> "LAWP" (4 bytes) = 8 bytes)
> While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
> [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]
> {code:java}
>   private void tryAdvanceEntry() throws IOException {
> if (checkReader()) {
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) { // no more entries in this log file - see 
> if log was rolled
> if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
>   // Before dequeueing, we should always get one more attempt at 
> reading.
>   // This is in case more entries came in after we opened the reader,
>   // and a new log was enqueued while we were reading. See HBASE-6758
>   resetReader(); ---> HERE
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) {
> if (checkAllBytesParsed()) { // now we're certain we're done with 
> this log file
>   dequeueCurrentLog();
>   if (openNextLog()) {
> readNextEntryAndSetPosition();
>   }
> }
>   }
> } // no other logs, we've simply hit the end of the current open log. 
> Do nothing
>   }
> }
> // do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
> queue)
>   }
> {code}
> In resetReader, we call the following methods, WALEntryStream#resetReader  
> >  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
> In ProtobufLogReader#initInternal, we try to create the whole reader object 
> from scratch to see if any new data has been written.
> We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
> We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25932:
-
Component/s: wal
 metrics

> TestWALEntryStream#testCleanClosedWALs test is failing.
> ---
>
> Key: HBASE-25932
> URL: https://issues.apache.org/jira/browse/HBASE-25932
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, Replication, wal
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
> Attachments: HBASE-25932-test-approach.patch
>
>
> We are seeing the following test failure. 
> TestWALEntryStream#testCleanClosedWALs
>  This test was added in HBASE-25924. I don't think the test failure has 
> anything to do with the patch in HBASE-25924.
>  Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
> all the branches, we were not parsing the wal trailer when we close the wal 
> reader inside ReplicationSourceWALReader thread. The root cause was when we 
> add active WAL to ReplicationSourceWALReader, we cache the file size when the 
> wal was being actively written and once the wal was closed and replicated and 
> removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
> didn't update the length of the wal and that was causing EOF errors since it 
> can't find the WALTrailer with the stale wal file length.
> The fix applied nicely to branch-1 since we use FSHlog implementation which 
> closes the WAL file sychronously.
> But in branch-2 and master, we use _AsyncFSWAL_ implementation and the 
> closing of wal file is done asynchronously (as the name suggests). This is 
> causing the test to fail. Below is the test.
> {code:java}
>   @Test
>   public void testCleanClosedWALs() throws Exception {
> try (WALEntryStream entryStream = new WALEntryStream(
>   logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   log.rollWriter();  ===> This does an asynchronous close of wal.
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> }
>   }
> {code}
> In the above code, when we roll writer, we don't close the old wal file 
> immediately so the ReplicationReader thread is not able to get the updated 
> wal file size and that is throwing EOF errors.
>  If I add a sleep of few milliseconds (1 ms in my local env) between 
> rollWriter and appendToLogAndSync statement then the test passes but this is 
> *not* a proper fix since we are working around the race between 
> ReplicationSourceWALReaderThread and closing of WAL file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25924:
-
Fix Version/s: (was: 2.3.6)

> Seeing a spike in uncleanlyClosedWALs metric.
> -
>
> Key: HBASE-25924
> URL: https://issues.apache.org/jira/browse/HBASE-25924
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1
>
>
> Getting the following log line in all of our production clusters when 
> WALEntryStream is dequeuing WAL file.
> {noformat}
>  2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - 
> Reached the end of WAL file hdfs://. It was not closed 
> cleanly, so we did not parse 8 bytes of data. This is normally ok.
> {noformat}
> The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + 
> "LAWP" (4 bytes) = 8 bytes)
> While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
> [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]
> {code:java}
>   private void tryAdvanceEntry() throws IOException {
> if (checkReader()) {
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) { // no more entries in this log file - see 
> if log was rolled
> if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
>   // Before dequeueing, we should always get one more attempt at 
> reading.
>   // This is in case more entries came in after we opened the reader,
>   // and a new log was enqueued while we were reading. See HBASE-6758
>   resetReader(); ---> HERE
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) {
> if (checkAllBytesParsed()) { // now we're certain we're done with 
> this log file
>   dequeueCurrentLog();
>   if (openNextLog()) {
> readNextEntryAndSetPosition();
>   }
> }
>   }
> } // no other logs, we've simply hit the end of the current open log. 
> Do nothing
>   }
> }
> // do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
> queue)
>   }
> {code}
> In resetReader, we call the following methods, WALEntryStream#resetReader  
> >  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
> In ProtobufLogReader#initInternal, we try to create the whole reader object 
> from scratch to see if any new data has been written.
> We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
> We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-06-01 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355221#comment-17355221
 ] 

Bharath Vissapragada commented on HBASE-25924:
--

Thanks, opened HBASE-25957 as a subtask. 

> Seeing a spike in uncleanlyClosedWALs metric.
> -
>
> Key: HBASE-25924
> URL: https://issues.apache.org/jira/browse/HBASE-25924
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4, 1.7.1
>
>
> Getting the following log line in all of our production clusters when 
> WALEntryStream is dequeuing WAL file.
> {noformat}
>  2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - 
> Reached the end of WAL file hdfs://. It was not closed 
> cleanly, so we did not parse 8 bytes of data. This is normally ok.
> {noformat}
> The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + 
> "LAWP" (4 bytes) = 8 bytes)
> While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
> [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]
> {code:java}
>   private void tryAdvanceEntry() throws IOException {
> if (checkReader()) {
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) { // no more entries in this log file - see 
> if log was rolled
> if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
>   // Before dequeueing, we should always get one more attempt at 
> reading.
>   // This is in case more entries came in after we opened the reader,
>   // and a new log was enqueued while we were reading. See HBASE-6758
>   resetReader(); ---> HERE
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) {
> if (checkAllBytesParsed()) { // now we're certain we're done with 
> this log file
>   dequeueCurrentLog();
>   if (openNextLog()) {
> readNextEntryAndSetPosition();
>   }
> }
>   }
> } // no other logs, we've simply hit the end of the current open log. 
> Do nothing
>   }
> }
> // do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
> queue)
>   }
> {code}
> In resetReader, we call the following methods, WALEntryStream#resetReader  
> >  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
> In ProtobufLogReader#initInternal, we try to create the whole reader object 
> from scratch to see if any new data has been written.
> We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
> We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3

2021-06-01 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355219#comment-17355219
 ] 

Bharath Vissapragada commented on HBASE-25957:
--

[~rushabh.shah] FYI, if you have any free cycles.

> Back port HBASE-25924 and its related patches to branch-2.3
> ---
>
> Key: HBASE-25957
> URL: https://issues.apache.org/jira/browse/HBASE-25957
> Project: HBase
>  Issue Type: Sub-task
>  Components: metrics, Replication, wal
>Affects Versions: 2.3.6
>Reporter: Bharath Vissapragada
>Priority: Major
>
> The branch has diverged a bit and the patches are not applying clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25957:
-
Component/s: wal
 Replication
 metrics

> Back port HBASE-25924 and its related patches to branch-2.3
> ---
>
> Key: HBASE-25957
> URL: https://issues.apache.org/jira/browse/HBASE-25957
> Project: HBase
>  Issue Type: Sub-task
>  Components: metrics, Replication, wal
>Affects Versions: 2.3.6
>Reporter: Bharath Vissapragada
>Priority: Major
>
> The branch has diverged a bit and the patches are not applying clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3

2021-06-01 Thread Bharath Vissapragada (Jira)
Bharath Vissapragada created HBASE-25957:


 Summary: Back port HBASE-25924 and its related patches to 
branch-2.3
 Key: HBASE-25957
 URL: https://issues.apache.org/jira/browse/HBASE-25957
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.3.6
Reporter: Bharath Vissapragada


The branch has diverged a bit and the patches are not applying clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-06-01 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355178#comment-17355178
 ] 

Bharath Vissapragada commented on HBASE-25924:
--

HBASE-25932 is now committed to master/branch-2/branch2.4.

[~apurtell] Whats the general guidance on back porting to branch-2.3? That 
branch has diverged quite a bit and this patch doesn't apply cleanly. 

> Seeing a spike in uncleanlyClosedWALs metric.
> -
>
> Key: HBASE-25924
> URL: https://issues.apache.org/jira/browse/HBASE-25924
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4, 1.7.1
>
>
> Getting the following log line in all of our production clusters when 
> WALEntryStream is dequeuing WAL file.
> {noformat}
>  2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - 
> Reached the end of WAL file hdfs://. It was not closed 
> cleanly, so we did not parse 8 bytes of data. This is normally ok.
> {noformat}
> The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + 
> "LAWP" (4 bytes) = 8 bytes)
> While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
> [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]
> {code:java}
>   private void tryAdvanceEntry() throws IOException {
> if (checkReader()) {
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) { // no more entries in this log file - see 
> if log was rolled
> if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
>   // Before dequeueing, we should always get one more attempt at 
> reading.
>   // This is in case more entries came in after we opened the reader,
>   // and a new log was enqueued while we were reading. See HBASE-6758
>   resetReader(); ---> HERE
>   readNextEntryAndSetPosition();
>   if (currentEntry == null) {
> if (checkAllBytesParsed()) { // now we're certain we're done with 
> this log file
>   dequeueCurrentLog();
>   if (openNextLog()) {
> readNextEntryAndSetPosition();
>   }
> }
>   }
> } // no other logs, we've simply hit the end of the current open log. 
> Do nothing
>   }
> }
> // do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
> queue)
>   }
> {code}
> In resetReader, we call the following methods, WALEntryStream#resetReader  
> >  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
> In ProtobufLogReader#initInternal, we try to create the whole reader object 
> from scratch to see if any new data has been written.
> We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
> We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25932:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> TestWALEntryStream#testCleanClosedWALs test is failing.
> ---
>
> Key: HBASE-25932
> URL: https://issues.apache.org/jira/browse/HBASE-25932
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
> Attachments: HBASE-25932-test-approach.patch
>
>
> We are seeing the following test failure. 
> TestWALEntryStream#testCleanClosedWALs
>  This test was added in HBASE-25924. I don't think the test failure has 
> anything to do with the patch in HBASE-25924.
>  Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
> all the branches, we were not parsing the wal trailer when we close the wal 
> reader inside ReplicationSourceWALReader thread. The root cause was when we 
> add active WAL to ReplicationSourceWALReader, we cache the file size when the 
> wal was being actively written and once the wal was closed and replicated and 
> removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
> didn't update the length of the wal and that was causing EOF errors since it 
> can't find the WALTrailer with the stale wal file length.
> The fix applied nicely to branch-1 since we use FSHlog implementation which 
> closes the WAL file sychronously.
> But in branch-2 and master, we use _AsyncFSWAL_ implementation and the 
> closing of wal file is done asynchronously (as the name suggests). This is 
> causing the test to fail. Below is the test.
> {code:java}
>   @Test
>   public void testCleanClosedWALs() throws Exception {
> try (WALEntryStream entryStream = new WALEntryStream(
>   logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   log.rollWriter();  ===> This does an asynchronous close of wal.
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> }
>   }
> {code}
> In the above code, when we roll writer, we don't close the old wal file 
> immediately so the ReplicationReader thread is not able to get the updated 
> wal file size and that is throwing EOF errors.
>  If I add a sleep of few milliseconds (1 ms in my local env) between 
> rollWriter and appendToLogAndSync statement then the test passes but this is 
> *not* a proper fix since we are working around the race between 
> ReplicationSourceWALReaderThread and closing of WAL file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

2021-06-01 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25932:
-
Fix Version/s: 2.4.4
   2.5.0
   3.0.0-alpha-1

> TestWALEntryStream#testCleanClosedWALs test is failing.
> ---
>
> Key: HBASE-25932
> URL: https://issues.apache.org/jira/browse/HBASE-25932
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Bharath Vissapragada
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
> Attachments: HBASE-25932-test-approach.patch
>
>
> We are seeing the following test failure. 
> TestWALEntryStream#testCleanClosedWALs
>  This test was added in HBASE-25924. I don't think the test failure has 
> anything to do with the patch in HBASE-25924.
>  Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
> all the branches, we were not parsing the wal trailer when we close the wal 
> reader inside ReplicationSourceWALReader thread. The root cause was when we 
> add active WAL to ReplicationSourceWALReader, we cache the file size when the 
> wal was being actively written and once the wal was closed and replicated and 
> removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
> didn't update the length of the wal and that was causing EOF errors since it 
> can't find the WALTrailer with the stale wal file length.
> The fix applied nicely to branch-1 since we use FSHlog implementation which 
> closes the WAL file sychronously.
> But in branch-2 and master, we use _AsyncFSWAL_ implementation and the 
> closing of wal file is done asynchronously (as the name suggests). This is 
> causing the test to fail. Below is the test.
> {code:java}
>   @Test
>   public void testCleanClosedWALs() throws Exception {
> try (WALEntryStream entryStream = new WALEntryStream(
>   logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   log.rollWriter();  ===> This does an asynchronous close of wal.
>   appendToLogAndSync();
>   assertNotNull(entryStream.next());
>   assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> }
>   }
> {code}
> In the above code, when we roll writer, we don't close the old wal file 
> immediately so the ReplicationReader thread is not able to get the updated 
> wal file size and that is throwing EOF errors.
>  If I add a sleep of few milliseconds (1 ms in my local env) between 
> rollWriter and appendToLogAndSync statement then the test passes but this is 
> *not* a proper fix since we are working around the race between 
> ReplicationSourceWALReaderThread and closing of WAL file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25942) Get rid of null regioninfo in wrapped connection exceptions

2021-05-30 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25942:
-
Fix Version/s: 2.4.4
   2.5.0
   3.0.0-alpha-1

> Get rid of null regioninfo in wrapped connection exceptions
> ---
>
> Key: HBASE-25942
> URL: https://issues.apache.org/jira/browse/HBASE-25942
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.4
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Trivial
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
>
> I was debugging something and noticed this,
> {noformat}
> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
> java.io.IOException: Call to address=localhost:59142null failed on local 
> exception: java.io.IOException: Connection reset by peer
> {noformat}
> `null` there is regioninfo and because this call has nothing set on it, it 
> was confusing where that null is coming from, added recently via HBASE-25735.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25942) Get rid of null regioninfo in wrapped connection exceptions

2021-05-30 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25942:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Get rid of null regioninfo in wrapped connection exceptions
> ---
>
> Key: HBASE-25942
> URL: https://issues.apache.org/jira/browse/HBASE-25942
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.4
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Trivial
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
>
> I was debugging something and noticed this,
> {noformat}
> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
> java.io.IOException: Call to address=localhost:59142null failed on local 
> exception: java.io.IOException: Connection reset by peer
> {noformat}
> `null` there is regioninfo and because this call has nothing set on it, it 
> was confusing where that null is coming from, added recently via HBASE-25735.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   >