[jira] [Updated] (HBASE-26548) Investigate mTLS in RPC layer
[ https://issues.apache.org/jira/browse/HBASE-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26548: - Description: Current authentication options are heavily based on SASL and Kerberos. For organizations that don't already deploy Kerberos or other token provider, this is a heavy lift. Another very common way of authenticating in the industry is mTLS, which makes use of SSL certifications and can solve both wire encryption and auth. For those already deploying trusted certificates in their infra, mTLS may be much easier to integrate. It isn't necessarily easy to implement this, but I do think we could use existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I know it's easy to add SSL to non-blocking IO through a hadoop.rpc.socket.factory.class.default which returns SSLSockets, but that doesn't touch on the certification verification at all. Much more investigation is needed, but logging this due to some interest encountered on slack. Slack thread: https://apache-hbase.slack.com/archives/C13K8NVAM/p1638980520110600 was: Current authentication options are heavily based on SASL and Kerberos. For organizations that don't already deploy Kerberos or other token provider, this is a heavy lift. Another very common way of authenticating in the industry is mTLS, which makes use of SSL certifications and can solve both wire encryption and auth. For those already deploying trusted certificates in their infra, mTLS may be much easier to integrate. It isn't necessarily easy to implement this, but I do think we could use existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I also know it's easy to add SSL to non-blocking IO through a hadoop.rpc.socket.factory.class.default which returns SSLSockets (we do this at my company), but that doesn't touch on the certification verification at all. Much more investigation is needed, but logging this due to some interest encountered on slack. > Investigate mTLS in RPC layer > - > > Key: HBASE-26548 > URL: https://issues.apache.org/jira/browse/HBASE-26548 > Project: HBase > Issue Type: New Feature >Reporter: Bryan Beaudreault >Priority: Major > > Current authentication options are heavily based on SASL and Kerberos. For > organizations that don't already deploy Kerberos or other token provider, > this is a heavy lift. Another very common way of authenticating in the > industry is mTLS, which makes use of SSL certifications and can solve both > wire encryption and auth. For those already deploying trusted certificates in > their infra, mTLS may be much easier to integrate. > It isn't necessarily easy to implement this, but I do think we could use > existing Netty SSL support in the NettyRpcClient and NettyRpcServer. I know > it's easy to add SSL to non-blocking IO through a > hadoop.rpc.socket.factory.class.default which returns SSLSockets, but that > doesn't touch on the certification verification at all. > Much more investigation is needed, but logging this due to some interest > encountered on slack. > Slack thread: > https://apache-hbase.slack.com/archives/C13K8NVAM/p1638980520110600 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HBASE-25051) DIGEST based auth broken for MasterRegistry
[ https://issues.apache.org/jira/browse/HBASE-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17398980#comment-17398980 ] Bharath Vissapragada edited comment on HBASE-25051 at 8/13/21, 11:11 PM: - I think one way to fix this problem is to "advertise" the cluster ID as a part of channel setup. This is an IPC protocol change. Currently the way connection setup works is 1. client to server: socket connect() 2. client sends a connection preamble (validated on server and connection closed if malformed) 3. client optionally does a sasl handshake (if configured) (3) is where cluster ID is needed (if token is configured as a part of DIGEST based auth). Now my proposal is to modify it as follows. 1. client to server: socket connect() 2. server responds with a 16 byte UUID that is read by the client (client can get the actual SASL mode with this clusterId info and looking up the tokens) 3. client sends a connection preamble (validated on server and connection closed if malformed) 4. client optionally does a sasl handshake (using the sasl and token from step(2)). A sample implementation using netty IPC is something like this https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814 . The patch still needs to do the following, 1. Move providers.selectProvider(clusterId, ticket) .. to after step (2) 2. Implement an equivalent change in Blocking IPC. One problem I can think of is compatibility between old client and new server (say during a rolling upgrade when server is upgraded first), I can see if I can get it working by making client ignore this piece of response. cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a better way? was (Author: bharathv): I think one way to fix this problem is to "advertise" the cluster ID as a part of channel setup. This is an IPC protocol change. Currently the way connection setup works is 1. client to server: socket connect() 2. client sends a connection preamble (validated on server and connection closed if malformed) 3. client optionally does a sasl handshake (if configured) (3) is where cluster ID is needed (if token is configured as a part of DIGEST based auth). Now my proposal is to modify it as follows. 1. client to server: socket connect() 2. server responds with a 16 byte UUID that is read by the client (client can get the actual SASL mode with this clusterId info and looking up the tokens) 2. client sends a connection preamble (validated on server and connection closed if malformed) 3. client optionally does a sasl handshake (using the sasl and token from step(2)). A sample implementation using netty IPC is something like this https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814 . The patch still needs to do the following, 1. Move providers.selectProvider(clusterId, ticket) .. to after step (2) 2. Implement an equivalent change in Blocking IPC. One problem I can think of is compatibility between old client and new server (say during a rolling upgrade when server is upgraded first), I can see if I can get it working by making client ignore this piece of response. cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a better way? > DIGEST based auth broken for MasterRegistry > --- > > Key: HBASE-25051 > URL: https://issues.apache.org/jira/browse/HBASE-25051 > Project: HBase > Issue Type: Sub-task > Components: Client, security >Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Minor > > DIGEST-MD5 based sasl auth depends on cluster-ID to obtain tokens. With > master registry, we have a circular dependency here because master registry > needs an rpcClient to talk to masters (and to get cluster ID) and rpc-Client > needs a clusterId if DIGEST based auth is configured. Earlier, there was a ZK > client that has its own authentication mechanism to fetch the cluster ID. > HBASE-23330, I think doesn't fully fix the problem. It depends on an active > connection to fetch delegation tokens for the MR job and that inherently > assumes that the active connection does not use a DIGEST auth. > It is not clear to me how common it is to use DIGEST based auth in > connections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25051) DIGEST based auth broken for MasterRegistry
[ https://issues.apache.org/jira/browse/HBASE-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17398980#comment-17398980 ] Bharath Vissapragada commented on HBASE-25051: -- I think one way to fix this problem is to "advertise" the cluster ID as a part of channel setup. This is an IPC protocol change. Currently the way connection setup works is 1. client to server: socket connect() 2. client sends a connection preamble (validated on server and connection closed if malformed) 3. client optionally does a sasl handshake (if configured) (3) is where cluster ID is needed (if token is configured as a part of DIGEST based auth). Now my proposal is to modify it as follows. 1. client to server: socket connect() 2. server responds with a 16 byte UUID that is read by the client (client can get the actual SASL mode with this clusterId info and looking up the tokens) 2. client sends a connection preamble (validated on server and connection closed if malformed) 3. client optionally does a sasl handshake (using the sasl and token from step(2)). A sample implementation using netty IPC is something like this https://github.com/bharathv/hbase/commit/95ff0d65828e8459d212e6173c00648b9c7b6814 . The patch still needs to do the following, 1. Move providers.selectProvider(clusterId, ticket) .. to after step (2) 2. Implement an equivalent change in Blocking IPC. One problem I can think of is compatibility between old client and new server (say during a rolling upgrade when server is upgraded first), I can see if I can get it working by making client ignore this piece of response. cc: [~zhangduo] WDYT. You think this breaks anything else too or is there a better way? > DIGEST based auth broken for MasterRegistry > --- > > Key: HBASE-25051 > URL: https://issues.apache.org/jira/browse/HBASE-25051 > Project: HBase > Issue Type: Sub-task > Components: Client, security >Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Minor > > DIGEST-MD5 based sasl auth depends on cluster-ID to obtain tokens. With > master registry, we have a circular dependency here because master registry > needs an rpcClient to talk to masters (and to get cluster ID) and rpc-Client > needs a clusterId if DIGEST based auth is configured. Earlier, there was a ZK > client that has its own authentication mechanism to fetch the cluster ID. > HBASE-23330, I think doesn't fully fix the problem. It depends on an active > connection to fetch delegation tokens for the MR job and that inherently > assumes that the active connection does not use a DIGEST auth. > It is not clear to me how common it is to use DIGEST based auth in > connections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26195) Data is present in replicated cluster but not present in primary cluster.
[ https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26195: - Affects Version/s: 2.5.0 3.0.0-alpha-1 > Data is present in replicated cluster but not present in primary cluster. > - > > Key: HBASE-26195 > URL: https://issues.apache.org/jira/browse/HBASE-26195 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > > We encountered a case where we are seeing some rows (via Phoenix) in > replicated cluster but they are not present in source/active cluster. > Triaging further we found memstore rollback logs in few of the region servers. > {noformat} > 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > {noformat} > Looking more into logs, found that there were some hdfs layer issues sync'ing > wal to hdfs. > It was taking around 6 mins to sync wal. Logs below > {noformat} > 2021-07-28 14:19:30,511 WARN [sync.0] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391210ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,589 WARN [sync.1] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391148ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,589 WARN [sync.2] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391147ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,591 INFO [sync.0] wal.FSHLog - Slow sync cost: 391289 > ms, current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] > 2021-07-28 14:19:30,591 INFO [sync.1] wal.FSHLog - Slow sync cost: 391227 > ms, current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] > 2021-07-28 14:19:30,591 WARN [sync.1] wal.FSHLog - Requesting log roll > because we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, > current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > >
[jira] [Commented] (HBASE-26149) Further improvements on ConnectionRegistry implementations
[ https://issues.apache.org/jira/browse/HBASE-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396413#comment-17396413 ] Bharath Vissapragada commented on HBASE-26149: -- > This must have been in place always? [~stack] Please see HBASE-25051's description. Earlier we relied on an authenticated ZK client to get the cluster ID. Now its a chicken and egg problem if we are removing ZK from the picture. > Further improvements on ConnectionRegistry implementations > -- > > Key: HBASE-26149 > URL: https://issues.apache.org/jira/browse/HBASE-26149 > Project: HBase > Issue Type: Umbrella > Components: Client >Reporter: Duo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26158) MiniZooKeeperCluster should not be IA.Public
[ https://issues.apache.org/jira/browse/HBASE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392420#comment-17392420 ] Bharath Vissapragada commented on HBASE-26158: -- As Geoffrey mentioned, we use this class quite extensively as shared ZK test instance across multiple mini clusters and other services. It is a very handy wrapper that is tightly integrated with mini cluster and handles all the boiler plate code for us. We also don't have to worry about curator version compatibility with hbase. > MiniZooKeeperCluster should not be IA.Public > > > Key: HBASE-26158 > URL: https://issues.apache.org/jira/browse/HBASE-26158 > Project: HBase > Issue Type: Sub-task > Components: API, test >Reporter: Duo Zhang >Priority: Major > > End users do not need to test HBase when zookeeper is broken. And if users > want to start only a zookeeper cluster, they can just use curator-test, so I > do not think we should expose this class as IA.Public. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics
[ https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25469: - Affects Version/s: 2.3.7 2.4.6 3.0.0-alpha-1 > Create RIT servlet in HMaster to track more detailed RIT info not captured in > metrics > - > > Key: HBASE-25469 > URL: https://issues.apache.org/jira/browse/HBASE-25469 > Project: HBase > Issue Type: Improvement > Components: master >Affects Versions: 3.0.0-alpha-1, 2.4.6, 2.3.7 >Reporter: Caroline Zhou >Assignee: Caroline Zhou >Priority: Minor > Labels: observability > Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot > 2021-07-27 at 10.34.53.png > > > In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410. > There are some additional RIT details that would be helpful to have in one > place: > * RIT Start Time > * RIT Duration (ms) > * Server > * Procedure Type > This info can be added to the table under the {{/rit.jsp}} page, and we can > also add a button on that page to view info as JSON, for easy parsing into > metrics, etc. This JSON dump can be served as a servlet. > We may also consider different ways of grouping the JSON results, such as by > state, table, or server name, and/or adding counts of RIT by state or server > name. > !Screen Shot 2021-07-27 at 10.34.45.png! > !Screen Shot 2021-07-27 at 10.34.53.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics
[ https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25469: - Labels: observability (was: ) > Create RIT servlet in HMaster to track more detailed RIT info not captured in > metrics > - > > Key: HBASE-25469 > URL: https://issues.apache.org/jira/browse/HBASE-25469 > Project: HBase > Issue Type: Improvement > Components: master >Reporter: Caroline Zhou >Assignee: Caroline Zhou >Priority: Minor > Labels: observability > Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot > 2021-07-27 at 10.34.53.png > > > In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410. > There are some additional RIT details that would be helpful to have in one > place: > * RIT Start Time > * RIT Duration (ms) > * Server > * Procedure Type > This info can be added to the table under the {{/rit.jsp}} page, and we can > also add a button on that page to view info as JSON, for easy parsing into > metrics, etc. This JSON dump can be served as a servlet. > We may also consider different ways of grouping the JSON results, such as by > state, table, or server name, and/or adding counts of RIT by state or server > name. > !Screen Shot 2021-07-27 at 10.34.45.png! > !Screen Shot 2021-07-27 at 10.34.53.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics
[ https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25469: - Component/s: master > Create RIT servlet in HMaster to track more detailed RIT info not captured in > metrics > - > > Key: HBASE-25469 > URL: https://issues.apache.org/jira/browse/HBASE-25469 > Project: HBase > Issue Type: Improvement > Components: master >Reporter: Caroline Zhou >Assignee: Caroline Zhou >Priority: Minor > Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot > 2021-07-27 at 10.34.53.png > > > In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410. > There are some additional RIT details that would be helpful to have in one > place: > * RIT Start Time > * RIT Duration (ms) > * Server > * Procedure Type > This info can be added to the table under the {{/rit.jsp}} page, and we can > also add a button on that page to view info as JSON, for easy parsing into > metrics, etc. This JSON dump can be served as a servlet. > We may also consider different ways of grouping the JSON results, such as by > state, table, or server name, and/or adding counts of RIT by state or server > name. > !Screen Shot 2021-07-27 at 10.34.45.png! > !Screen Shot 2021-07-27 at 10.34.53.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release
[ https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385650#comment-17385650 ] Bharath Vissapragada commented on HBASE-26109: -- 1.7.1 links are up, waiting for 1.7.0 to be deleted in all the mirrors. Then I can commit HBASE-26071 and send out a notice. > Wrap up 1.7.1 release > - > > Key: HBASE-26109 > URL: https://issues.apache.org/jira/browse/HBASE-26109 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: release > Fix For: 1.7.1 > > > - Set branch-1 version to 1.7.2-SNAPSHOT > - Update SVN release bits, add download links to 1.7.1 > - Remove 1.7.0 from the mirrors > - Commit HBASE-26071 > in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26110) Add download links for 1.7.1
[ https://issues.apache.org/jira/browse/HBASE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26110: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks for the reviews. > Add download links for 1.7.1 > > > Key: HBASE-26110 > URL: https://issues.apache.org/jira/browse/HBASE-26110 > Project: HBase > Issue Type: Sub-task > Components: website >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26110) Add download links for 1.7.1
[ https://issues.apache.org/jira/browse/HBASE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26110: - Status: Patch Available (was: Open) > Add download links for 1.7.1 > > > Key: HBASE-26110 > URL: https://issues.apache.org/jira/browse/HBASE-26110 > Project: HBase > Issue Type: Sub-task > Components: website >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26110) Add download links for 1.7.1
Bharath Vissapragada created HBASE-26110: Summary: Add download links for 1.7.1 Key: HBASE-26110 URL: https://issues.apache.org/jira/browse/HBASE-26110 Project: HBase Issue Type: Sub-task Components: website Affects Versions: 1.7.1 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada Fix For: 1.7.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release
[ https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385117#comment-17385117 ] Bharath Vissapragada commented on HBASE-26109: -- Waiting for the mirrors to catch up so that I can make the doc change. > Wrap up 1.7.1 release > - > > Key: HBASE-26109 > URL: https://issues.apache.org/jira/browse/HBASE-26109 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: release > Fix For: 1.7.1 > > > - Set branch-1 version to 1.7.2-SNAPSHOT > - Update SVN release bits, add download links to 1.7.1 > - Remove 1.7.0 from the mirrors > - Commit HBASE-26071 > in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26109) Wrap up 1.7.1 release
[ https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385089#comment-17385089 ] Bharath Vissapragada commented on HBASE-26109: -- Looks like I don't have access to the svn repo to move content around. Working with Andrew offline to move the RC bits to release. {noformat} svn mv https://dist.apache.org/repos/dist/dev/hbase/1.7.1RC0 https://dist.apache.org/repos/dist/release/hbase/1.7.1 -m"Release Apache HBase 1.7.1" svn: E175013: Access to '/repos/dist/!svn/rvr/48933/dev/hbase/1.7.1RC0' forbidden {noformat} > Wrap up 1.7.1 release > - > > Key: HBASE-26109 > URL: https://issues.apache.org/jira/browse/HBASE-26109 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: release > Fix For: 1.7.1 > > > - Set branch-1 version to 1.7.2-SNAPSHOT > - Update SVN release bits, add download links to 1.7.1 > - Remove 1.7.0 from the mirrors > - Commit HBASE-26071 > in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26109) Wrap up 1.7.1 release
[ https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26109: - Affects Version/s: 1.7.1 > Wrap up 1.7.1 release > - > > Key: HBASE-26109 > URL: https://issues.apache.org/jira/browse/HBASE-26109 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 1.7.1 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: release > Fix For: 1.7.1 > > > - Set branch-1 version to 1.7.2-SNAPSHOT > - Update SVN release bits, add download links to 1.7.1 > - Remove 1.7.0 from the mirrors > - Commit HBASE-26071 > in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26109) Wrap up 1.7.1 release
[ https://issues.apache.org/jira/browse/HBASE-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26109: - Labels: release (was: ) > Wrap up 1.7.1 release > - > > Key: HBASE-26109 > URL: https://issues.apache.org/jira/browse/HBASE-26109 > Project: HBase > Issue Type: Task > Components: documentation >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: release > Fix For: 1.7.1 > > > - Set branch-1 version to 1.7.2-SNAPSHOT > - Update SVN release bits, add download links to 1.7.1 > - Remove 1.7.0 from the mirrors > - Commit HBASE-26071 > in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26109) Wrap up 1.7.1 release
Bharath Vissapragada created HBASE-26109: Summary: Wrap up 1.7.1 release Key: HBASE-26109 URL: https://issues.apache.org/jira/browse/HBASE-26109 Project: HBase Issue Type: Task Components: documentation Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada Fix For: 1.7.1 - Set branch-1 version to 1.7.2-SNAPSHOT - Update SVN release bits, add download links to 1.7.1 - Remove 1.7.0 from the mirrors - Commit HBASE-26071 in that order. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada resolved HBASE-26075. -- Fix Version/s: 1.7.1 Resolution: Fixed > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > Fix For: 1.7.1 > > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) && > (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) { > Path head = queue.peek(); >
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381772#comment-17381772 ] Bharath Vissapragada commented on HBASE-26075: -- [~shahrs87] Can you please submit the PRs for master/branch-2 as a part of the child task? marking this resolved for 1.7.1 and the release work is in progress. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
[jira] [Created] (HBASE-26093) Replication is stuck due to zero length wal file in oldWALs directory [master/branch-2]
Bharath Vissapragada created HBASE-26093: Summary: Replication is stuck due to zero length wal file in oldWALs directory [master/branch-2] Key: HBASE-26093 URL: https://issues.apache.org/jira/browse/HBASE-26093 Project: HBase Issue Type: Sub-task Affects Versions: 3.0.0-alpha-2, 2.4.5 Reporter: Bharath Vissapragada Assignee: Rushabh Shah Clone of the parent issue for branch-2/master. Resolving the parent as the issue is fixed in 1.7.1 and the jira needs to be in a resolved state for release notes curation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26083) L1 miss metric is incorrect when using CombinedBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381766#comment-17381766 ] Bharath Vissapragada commented on HBASE-26083: -- I just tagged RC0 moments back and triggered the release build which is running as I type this, lets include it in the next release? > L1 miss metric is incorrect when using CombinedBlockCache > - > > Key: HBASE-26083 > URL: https://issues.apache.org/jira/browse/HBASE-26083 > Project: HBase > Issue Type: Bug > Components: BlockCache >Reporter: Yutong Xiao >Assignee: Yutong Xiao >Priority: Minor > Fix For: 2.3.6, 3.0.0-alpha-2, 2.4.5 > > > In CombinedBlockCache getBlock method > {code:java} > public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching, > boolean repeat, boolean updateCacheMetrics) { > // TODO: is there a hole here, or just awkwardness since in the lruCache > getBlock > // we end up calling l2Cache.getBlock. > // We are not in a position to exactly look at LRU cache or BC as > BlockType may not be getting > // passed always. > return l1Cache.containsBlock(cacheKey)? > l1Cache.getBlock(cacheKey, caching, repeat, updateCacheMetrics): > l2Cache.getBlock(cacheKey, caching, repeat, updateCacheMetrics); > } > {code} > If L1 does not contains cacheKey, it will never check it in L1, so that L1 > miss metrics is always 0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381451#comment-17381451 ] Bharath Vissapragada commented on HBASE-26042: -- Ya scratch that. Now that I look closely, all the flushes are single threaded. Think a concurrent flush is not possible in this specific thread stack at least because this is done _before_ swapping the WAL writer reference (consume path can be ignored). I'm still staring at the code (from the perspective of a single flush not being ack-ed + channel close scenario and race between completed on one channel and failed on another) but don't have any theory yet. [~stack] Its probably easy to prove/disprove this theory/direction if you have a heap dump from that state? Poke around the call back its channel ID vs closed channel and the AsyncFSOutput instance state of the new WAL? > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797
[jira] [Commented] (HBASE-25823) TestSlowLogAccessor.testHigherSlowLogs repeatable failure
[ https://issues.apache.org/jira/browse/HBASE-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380762#comment-17380762 ] Bharath Vissapragada commented on HBASE-25823: -- Still failing on branch-1 (2 of last 3 runs). https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/147/ https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148/ Briefly looked at it, think the time based wait should be redone to something more deterministic based on when the records are inserted into that table. > TestSlowLogAccessor.testHigherSlowLogs repeatable failure > - > > Key: HBASE-25823 > URL: https://issues.apache.org/jira/browse/HBASE-25823 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Major > Fix For: 2.4.5 > > > {noformat} > [ERROR] TestSlowLogAccessor.testHigherSlowLogs:211 Waiting timed out after > [7,000] msec{noformat} > Repeatable failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26075: - Release Note: Handles 0 length WAL files moved to oldWALs directory so that they do not block the replication queue. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) && > (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) { >
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378324#comment-17378324 ] Bharath Vissapragada commented on HBASE-26075: -- [~shahrs87] Please submit a patch for master once you are freed up, merged this in branch-1 for now. Thanks. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) && > (source.isRecovered() || queue.size() > 1)
[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
[ https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26070: - Release Note: Tables metadata created on a 1.7.0 cluster with faulty serialization from HBASE-26021 is now automatically rewritten with correct serialization format at active master bootstrap. Refer to hbase upgrade documentation on 1.7.x for further details. > Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables > -- > > Key: HBASE-26070 > URL: https://issues.apache.org/jira/browse/HBASE-26070 > Project: HBase > Issue Type: Sub-task >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > > As discussed in the parent issue, 1.7.0 introduced an incompatible table > serialization change. We need tooling that auto converts these into 1.7.1 > compatible table descriptors and table states. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26021: - Release Note: Incompatible table metadata serialization that was added to 1.7.0 is undone by this change. This faulty serialization causes upgrade issues from pre-1.7.0 versions to 1.7.0/2.x. Release 1.7.0 will be withdrawn from mirrors and the suggested upgrade version is 1.7.1+. > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Labels: compatibility, incompatibility, serialization, upgrade > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at
[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair
[ https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25130: - Release Note: (was: Fixed on branch-1.) > [branch-1] Masters in-memory serverHoldings map is not cleared during hbck > repair > - > > Key: HBASE-25130 > URL: https://issues.apache.org/jira/browse/HBASE-25130 > Project: HBase > Issue Type: Bug > Components: hbck >Affects Versions: 1.7.0 >Reporter: Sandeep Guggilam >Assignee: Victor Li >Priority: Major > Fix For: 1.7.1 > > > {color:#1d1c1d}Incase of repairing overlaps, hbck essentially calls the > closeRegion RPC on RS followed by offline RPC on Master to offline all the > overlap regions that would be merged into a new region. {color} > {color:#1d1c1d}However the offline RPC doesn’t remove it from the > serverHoldings map unless the new state is MERGED/SPLIT > ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719]) > b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color} > {color:#1d1c1d}This is actually intended to match with the META entries and > would be removed later when the region is online on a different server. > However, in our case , the region would never be online on a new server, > hence the region info is never cleared from the map that is used by balancer > and SCP for incorrect reeassignment.{color} > {color:#1d1c1d}We might need to tackle this by removing the entries from the > map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for > this region which kind of matches the in-memory map’s expectation with the > META state.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair
[ https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25130: - Affects Version/s: 1.7.0 > [branch-1] Masters in-memory serverHoldings map is not cleared during hbck > repair > - > > Key: HBASE-25130 > URL: https://issues.apache.org/jira/browse/HBASE-25130 > Project: HBase > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Sandeep Guggilam >Assignee: Victor Li >Priority: Major > Fix For: 1.7.1 > > > {color:#1d1c1d}Incase of repairing overlaps, hbck essentially calls the > closeRegion RPC on RS followed by offline RPC on Master to offline all the > overlap regions that would be merged into a new region. {color} > {color:#1d1c1d}However the offline RPC doesn’t remove it from the > serverHoldings map unless the new state is MERGED/SPLIT > ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719]) > b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color} > {color:#1d1c1d}This is actually intended to match with the META entries and > would be removed later when the region is online on a different server. > However, in our case , the region would never be online on a new server, > hence the region info is never cleared from the map that is used by balancer > and SCP for incorrect reeassignment.{color} > {color:#1d1c1d}We might need to tackle this by removing the entries from the > map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for > this region which kind of matches the in-memory map’s expectation with the > META state.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25130) [branch-1] Masters in-memory serverHoldings map is not cleared during hbck repair
[ https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25130: - Component/s: hbck > [branch-1] Masters in-memory serverHoldings map is not cleared during hbck > repair > - > > Key: HBASE-25130 > URL: https://issues.apache.org/jira/browse/HBASE-25130 > Project: HBase > Issue Type: Bug > Components: hbck >Affects Versions: 1.7.0 >Reporter: Sandeep Guggilam >Assignee: Victor Li >Priority: Major > Fix For: 1.7.1 > > > {color:#1d1c1d}Incase of repairing overlaps, hbck essentially calls the > closeRegion RPC on RS followed by offline RPC on Master to offline all the > overlap regions that would be merged into a new region. {color} > {color:#1d1c1d}However the offline RPC doesn’t remove it from the > serverHoldings map unless the new state is MERGED/SPLIT > ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719]) > b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color} > {color:#1d1c1d}This is actually intended to match with the META entries and > would be removed later when the region is online on a different server. > However, in our case , the region would never be online on a new server, > hence the region info is never cleared from the map that is used by balancer > and SCP for incorrect reeassignment.{color} > {color:#1d1c1d}We might need to tackle this by removing the entries from the > map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for > this region which kind of matches the in-memory map’s expectation with the > META state.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-21674) Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from thrift1 server) to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378290#comment-17378290 ] Bharath Vissapragada commented on HBASE-21674: -- Copied the release note from the parent. > Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from > thrift1 server) to branch-1 > > > Key: HBASE-21674 > URL: https://issues.apache.org/jira/browse/HBASE-21674 > Project: HBase > Issue Type: Sub-task > Components: Thrift >Reporter: Andrew Kyle Purtell >Assignee: Yutong Xiao >Priority: Major > Fix For: 1.7.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-21674) Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from thrift1 server) to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-21674: - Release Note: Before this issue, thrift1 server and thrift2 server are totally different servers. If a new feature is added to thrift1 server, thrfit2 server have to make the same change to support it(e.g. authorization). After this issue, thrift2 server is inherited from thrift1, thrift2 server now have all the features thrift1 server has(e.g http support, which thrift2 server doesn't have before). The way to start thrift1 or thrift2 server remain the same after this issue. > Port HBASE-21652 (Refactor ThriftServer making thrift2 server inherited from > thrift1 server) to branch-1 > > > Key: HBASE-21674 > URL: https://issues.apache.org/jira/browse/HBASE-21674 > Project: HBase > Issue Type: Sub-task > Components: Thrift >Reporter: Andrew Kyle Purtell >Assignee: Yutong Xiao >Priority: Major > Fix For: 1.7.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25984: - Release Note: Fixes a WAL lockup issue due to premature reuse of the sync futures by the WAL consumers. The lockup causes the WAL system to hang resulting in blocked appends and syncs thus holding up the RPC handlers from progressing. Only workaround without this fix is to force abort the region server. > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5 > > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26074: - Resolution: Fixed Status: Resolved (was: Patch Available) > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 > -- > > Key: HBASE-26074 > URL: https://issues.apache.org/jira/browse/HBASE-26074 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: regression > Fix For: 1.7.1 > > > In prep-ing for 1.7.1, I noticed that this test has been consistently failing > on branch-1 forever. > {noformat} > Regression > health checks / yetus jdk8 hadoop2 checks / > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego > Failing for the past 1 build (Since Failed#145 ) > Took 0.79 sec. > Error Message > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Stacktrace > java.lang.AssertionError: > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at
[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26074: - Status: Patch Available (was: Open) > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 > -- > > Key: HBASE-26074 > URL: https://issues.apache.org/jira/browse/HBASE-26074 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: regression > Fix For: 1.7.1 > > > In prep-ing for 1.7.1, I noticed that this test has been consistently failing > on branch-1 forever. > {noformat} > Regression > health checks / yetus jdk8 hadoop2 checks / > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego > Failing for the past 1 build (Since Failed#145 ) > Took 0.79 sec. > Error Message > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Stacktrace > java.lang.AssertionError: > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377533#comment-17377533 ] Bharath Vissapragada commented on HBASE-26075: -- Hoping to get an RC out before end of this week (unless there are any unforeseen delays in the process), any chance this can be committed before then? I can help review the patch. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause()
[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377523#comment-17377523 ] Bharath Vissapragada commented on HBASE-26074: -- I figured out the issue, it appears the difference is in the SSL libraries shipped by open jdk (or oracle jdks, commonly used by the developers) and the azul JDKs used in the yetus test runs. If we carefully notice the bottom of the stack on the server side doing the SSL hand shakes, we can see the difference. Open JDK's SSL libs has special handling that recognizes it is an HTTP connection. {noformat} 2021-07-08 10:25:33,541 WARN [984520532@qtp-1977083615-0] log.Slf4jLog(89): EXCEPTION javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:710) at sun.security.ssl.InputRecord.read(InputRecord.java:527) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2021-07-08 10:25:33,542 WARN [1227853087@qtp-1977083615-2] log.Slf4jLog(89): EXCEPTION javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:710) at sun.security.ssl.InputRecord.read(InputRecord.java:527) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2021-07-08 10:25:33,543 INFO [Time-limited test] log.Slf4jLog(67): Stopped SslSocketConnectorSecure@localhost:0 {noformat} Whereas the azul JDK uses a different version of ssl lib that treats it as a generic SSL failure and resets the connection. {noformat} javax.net.ssl.SSLException: Unsupported or unrecognized SSL message at sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:440) at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:175) at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110) at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1197) at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1106) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:398) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:370) at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2021-07-08 13:56:33,322 WARN [189676137@qtp-1691391859-0] log.Slf4jLog(89): EXCEPTION javax.net.ssl.SSLException: Unsupported or unrecognized SSL message at sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:440) at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:175) at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110) at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1197) at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1106) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:398) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:370) at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:708) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2021-07-08 13:56:33,323 INFO [Time-limited test] log.Slf4jLog(67): Stopped SslSocketConnectorSecure@localhost:0 {noformat} I downloaded the azul JDK from https://www.azul.com/downloads/?version=java-8-lts=macos=x86-64-bit=jdk=true (tests seem to use `zulu8.48.0.51-ca-jdk8.0.26`) and then I can reproduce the same error on my mac book. Let me create a patch to generically assert the error. [~reidchan] FYI (long pending branch-1 unit test issue). > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 >
[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377499#comment-17377499 ] Bharath Vissapragada commented on HBASE-26074: -- Runs ok locally (mac), looks like the issue is only on CI. > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 > -- > > Key: HBASE-26074 > URL: https://issues.apache.org/jira/browse/HBASE-26074 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: regression > Fix For: 1.7.1 > > > In prep-ing for 1.7.1, I noticed that this test has been consistently failing > on branch-1 forever. > {noformat} > Regression > health checks / yetus jdk8 hadoop2 checks / > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego > Failing for the past 1 build (Since Failed#145 ) > Took 0.79 sec. > Error Message > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Stacktrace > java.lang.AssertionError: > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at
[jira] [Updated] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26074: - Labels: regression (was: ) > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 > -- > > Key: HBASE-26074 > URL: https://issues.apache.org/jira/browse/HBASE-26074 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Labels: regression > Fix For: 1.7.1 > > > In prep-ing for 1.7.1, I noticed that this test has been consistently failing > on branch-1 forever. > {noformat} > Regression > health checks / yetus jdk8 hadoop2 checks / > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego > Failing for the past 1 build (Since Failed#145 ) > Took 0.79 sec. > Error Message > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Stacktrace > java.lang.AssertionError: > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at
[jira] [Commented] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
[ https://issues.apache.org/jira/browse/HBASE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377491#comment-17377491 ] Bharath Vissapragada commented on HBASE-26074: -- Sample jenkins job (that will expire soon anyway): https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/145/ > testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on > branch-1 > -- > > Key: HBASE-26074 > URL: https://issues.apache.org/jira/browse/HBASE-26074 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > > In prep-ing for 1.7.1, I noticed that this test has been consistently failing > on branch-1 forever. > {noformat} > Regression > health checks / yetus jdk8 hadoop2 checks / > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego > Failing for the past 1 build (Since Failed#145 ) > Took 0.79 sec. > Error Message > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) > at > org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) > at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) > at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) > at > org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Stacktrace > java.lang.AssertionError: > Expected to find 'Unexpected end of file from server' but got unexpected > exception:java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at
[jira] [Created] (HBASE-26074) testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1
Bharath Vissapragada created HBASE-26074: Summary: testLogLevelByHttps/testLogLevelByHttpsWithSpnego consistently failing on branch-1 Key: HBASE-26074 URL: https://issues.apache.org/jira/browse/HBASE-26074 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.7.0 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada Fix For: 1.7.1 In prep-ing for 1.7.1, I noticed that this test has been consistently failing on branch-1 forever. {noformat} Regression health checks / yetus jdk8 hadoop2 checks / org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego Failing for the past 1 build (Since Failed#145 ) Took 0.79 sec. Error Message Expected to find 'Unexpected end of file from server' but got unexpected exception:java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:347) at org.apache.hadoop.hbase.http.log.LogLevel$CLI.connect(LogLevel.java:268) at org.apache.hadoop.hbase.http.log.LogLevel$CLI.process(LogLevel.java:284) at org.apache.hadoop.hbase.http.log.LogLevel$CLI.doGetLevel(LogLevel.java:227) at org.apache.hadoop.hbase.http.log.LogLevel$CLI.sendLogLevelRequest(LogLevel.java:123) at org.apache.hadoop.hbase.http.log.LogLevel$CLI.run(LogLevel.java:107) at org.apache.hadoop.hbase.http.log.TestLogLevel.getLevel(TestLogLevel.java:349) at org.apache.hadoop.hbase.http.log.TestLogLevel.access$000(TestLogLevel.java:68) at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:325) at org.apache.hadoop.hbase.http.log.TestLogLevel$1.run(TestLogLevel.java:322) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:322) at org.apache.hadoop.hbase.http.log.TestLogLevel.testDynamicLogLevel(TestLogLevel.java:277) at org.apache.hadoop.hbase.http.log.TestLogLevel.testLogLevelByHttpsWithSpnego(TestLogLevel.java:451) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) Stacktrace java.lang.AssertionError: Expected to find 'Unexpected end of file from server' but got unexpected exception:java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:702) at
[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
[ https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26070: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~reidchan]. > Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables > -- > > Key: HBASE-26070 > URL: https://issues.apache.org/jira/browse/HBASE-26070 > Project: HBase > Issue Type: Sub-task >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > > As discussed in the parent issue, 1.7.0 introduced an incompatible table > serialization change. We need tooling that auto converts these into 1.7.1 > compatible table descriptors and table states. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376818#comment-17376818 ] Bharath Vissapragada commented on HBASE-26042: -- bq. I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially around when a new writer is marked 'BROKEN' and a flush is called resulting in some waitingAckQueue members not being cleaned up correctly. Just a theory at this point, but probably easy to poke around with a heap dump or some related logging around these code paths. [~stack] Attached a unit test repro patch (which includes comments that explain the race). I followed the theory of broken cb cleanup in FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break point at latch#await() to inspect the state. WDYT. > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101,
[jira] [Comment Edited] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376818#comment-17376818 ] Bharath Vissapragada edited comment on HBASE-26042 at 7/7/21, 8:21 PM: --- bq. I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially around when a new writer is marked 'BROKEN' and a flush is called resulting in some waitingAckQueue members not being cleaned up correctly. Just a theory at this point, but probably easy to poke around with a heap dump or some related logging around these code paths. [~stack] Attached a unit test repro patch (on branch-2.3) (which includes comments that explain the race). I followed the theory of broken cb cleanup in FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break point at latch#await() to inspect the state. WDYT. was (Author: bharathv): bq. I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially around when a new writer is marked 'BROKEN' and a flush is called resulting in some waitingAckQueue members not being cleaned up correctly. Just a theory at this point, but probably easy to poke around with a heap dump or some related logging around these code paths. [~stack] Attached a unit test repro patch (which includes comments that explain the race). I followed the theory of broken cb cleanup in FanOutOneBlockAsyncDFSOutput that I alluded to above. With the patch, you can run TestFanOutOneBlockAsyncDFSOutput#testRecover under a debugger and set break point at latch#await() to inspect the state. WDYT. > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, >
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26042: - Attachment: HBASE-26042-test-repro.patch > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) >
[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT
[ https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376713#comment-17376713 ] Bharath Vissapragada commented on HBASE-25761: -- No worries, looking forward to the meeting notes. > POC: hbase:meta,,1 as ROOT > -- > > Key: HBASE-25761 > URL: https://issues.apache.org/jira/browse/HBASE-25761 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Francis Christopher Liu >Priority: Major > > One of the proposals up in the split-meta design doc suggests a > sleight-of-hand where the current hard-coded hbase:meta,,1 Region is > leveraged to serve as first Region of a split hbase:meta but also does > double-duty as 'ROOT'. This suggestion was put aside as a complicating > recursion in chat but then Francis noticed on a re-read of the BigTable > paper, that this is how they describe they do 'ROOT': "The root tablet is > just the first tablet in the METADATA table, but is treated specially -- it > is never split..." > This issue is for playing around with this notion to see what the problems > are so can do a better description of this approach here, in the design: > https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT
[ https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376688#comment-17376688 ] Bharath Vissapragada commented on HBASE-25761: -- [~stack] No sir, my appointment is 3pm-5pm Pacific, hence was suggesting 6pm. I can either do today 6pm (which doesn't work for Duo) or any time tomorrow (Thursday evening pacific / Friday morning China). > POC: hbase:meta,,1 as ROOT > -- > > Key: HBASE-25761 > URL: https://issues.apache.org/jira/browse/HBASE-25761 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Francis Christopher Liu >Priority: Major > > One of the proposals up in the split-meta design doc suggests a > sleight-of-hand where the current hard-coded hbase:meta,,1 Region is > leveraged to serve as first Region of a split hbase:meta but also does > double-duty as 'ROOT'. This suggestion was put aside as a complicating > recursion in chat but then Francis noticed on a re-read of the BigTable > paper, that this is how they describe they do 'ROOT': "The root tablet is > just the first tablet in the METADATA table, but is treated specially -- it > is never split..." > This issue is for playing around with this notion to see what the problems > are so can do a better description of this approach here, in the design: > https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT
[ https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376663#comment-17376663 ] Bharath Vissapragada commented on HBASE-25761: -- bq. Shout if it works for you and I'll update the dev list. Thanks. Unlikely that I'll be home before 6pm with the traffic. I'll join once I'm back but please continue with the original schedule. I'll catch up using the meeting notes. Looking forward to the progress in this work and happy to contribute to it. > POC: hbase:meta,,1 as ROOT > -- > > Key: HBASE-25761 > URL: https://issues.apache.org/jira/browse/HBASE-25761 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Francis Christopher Liu >Priority: Major > > One of the proposals up in the split-meta design doc suggests a > sleight-of-hand where the current hard-coded hbase:meta,,1 Region is > leveraged to serve as first Region of a split hbase:meta but also does > double-duty as 'ROOT'. This suggestion was put aside as a complicating > recursion in chat but then Francis noticed on a re-read of the BigTable > paper, that this is how they describe they do 'ROOT': "The root tablet is > just the first tablet in the METADATA table, but is treated specially -- it > is never split..." > This issue is for playing around with this notion to see what the problems > are so can do a better description of this approach here, in the design: > https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT
[ https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376138#comment-17376138 ] Bharath Vissapragada commented on HBASE-25761: -- If 6pm Pacific (9am China) works for everyone, prefer that. If not please continue with the original schedule, I'll join when I'm back. Thanks. > POC: hbase:meta,,1 as ROOT > -- > > Key: HBASE-25761 > URL: https://issues.apache.org/jira/browse/HBASE-25761 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Francis Christopher Liu >Priority: Major > > One of the proposals up in the split-meta design doc suggests a > sleight-of-hand where the current hard-coded hbase:meta,,1 Region is > leveraged to serve as first Region of a split hbase:meta but also does > double-duty as 'ROOT'. This suggestion was put aside as a complicating > recursion in chat but then Francis noticed on a re-read of the BigTable > paper, that this is how they describe they do 'ROOT': "The root tablet is > just the first tablet in the METADATA table, but is treated specially -- it > is never split..." > This issue is for playing around with this notion to see what the problems > are so can do a better description of this approach here, in the design: > https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT
[ https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376121#comment-17376121 ] Bharath Vissapragada commented on HBASE-25761: -- [~stack] I think the at-mention was for me (looks like an incorrect tag, sorry if its not). Either way, I have an appointment until 5pm Pacific and then I need to drive back, so may be able to join 6pm if folks are still around. I can catch up later via some meeting notes/summary, don't want the meeting to me moved because of me. Thanks (looking forward to the meeting outcome _smile_). > POC: hbase:meta,,1 as ROOT > -- > > Key: HBASE-25761 > URL: https://issues.apache.org/jira/browse/HBASE-25761 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Francis Christopher Liu >Priority: Major > > One of the proposals up in the split-meta design doc suggests a > sleight-of-hand where the current hard-coded hbase:meta,,1 Region is > leveraged to serve as first Region of a split hbase:meta but also does > double-duty as 'ROOT'. This suggestion was put aside as a complicating > recursion in chat but then Francis noticed on a re-read of the BigTable > paper, that this is how they describe they do 'ROOT': "The root tablet is > just the first tablet in the METADATA table, but is treated specially -- it > is never split..." > This issue is for playing around with this notion to see what the problems > are so can do a better description of this approach here, in the design: > https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376119#comment-17376119 ] Bharath Vissapragada commented on HBASE-26021: -- [PR|https://github.com/apache/hbase/pull/3463] for 1.7.0 table fixups, any feedback appreciated. > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Labels: compatibility, incompatibility, serialization, upgrade > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at >
[jira] [Updated] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
[ https://issues.apache.org/jira/browse/HBASE-26070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26070: - Status: Patch Available (was: Open) > Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables > -- > > Key: HBASE-26070 > URL: https://issues.apache.org/jira/browse/HBASE-26070 > Project: HBase > Issue Type: Sub-task >Affects Versions: 1.7.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > > As discussed in the parent issue, 1.7.0 introduced an incompatible table > serialization change. We need tooling that auto converts these into 1.7.1 > compatible table descriptors and table states. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375827#comment-17375827 ] Bharath Vissapragada commented on HBASE-26021: -- Created subtasks for the documentation and serialization fixups. I think it is unlikely that anyone upgraded to 1.7.0 (with existing tables) because they'd run into this problem. However agree that there could be users who installed 1.7.0 from scratch or upgraded an empty cluster to 1.7.0 and then created tables with the faulty serialization. Let me see if there is an easy way to fix them up at runtime (with some caveats). > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Labels: compatibility, incompatibility, serialization, upgrade > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at
[jira] [Created] (HBASE-26071) Document HBASE-26070 and upgrade considerations for 1.7.0/1.7.1
Bharath Vissapragada created HBASE-26071: Summary: Document HBASE-26070 and upgrade considerations for 1.7.0/1.7.1 Key: HBASE-26071 URL: https://issues.apache.org/jira/browse/HBASE-26071 Project: HBase Issue Type: Sub-task Components: documentation Affects Versions: 1.7.0 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada Fix For: 1.7.1 Document the incompatible serialization problem and upgrade paths for users that are on 1.7.0 and prior releases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26070) Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables
Bharath Vissapragada created HBASE-26070: Summary: Tooling to auto-convert 1.7.0 serialized tables to 1.7.1 compatible tables Key: HBASE-26070 URL: https://issues.apache.org/jira/browse/HBASE-26070 Project: HBase Issue Type: Sub-task Affects Versions: 1.7.0 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada Fix For: 1.7.1 As discussed in the parent issue, 1.7.0 introduced an incompatible table serialization change. We need tooling that auto converts these into 1.7.1 compatible table descriptors and table states. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372926#comment-17372926 ] Bharath Vissapragada commented on HBASE-26042: -- Thanks for the jstacks, I think consume is not scheduled further because the sync is broken and that is not clearing up the ring buffer. To me the most suspicious stack is the following hung flush thread. {noformat} Thread 9464: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame) - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, line=1707 (Compiled frame) - java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) @bci=119, line=3323 (Compiled frame) - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, line=1742 (Compiled frame) - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) @bci=16, line=189 (Compiled frame) - org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) @bci=9, line=202 (Compiled frame) {noformat} Given the state of this thread, I think the new writer instance (from the roll) is also broken for some reason and some of the callbacks (one of which is the future from the above thread) are not cleaned up correctly. I think there is some racy code in FanOutOneBlockAsyncDFSOutput especially around when a new writer is marked 'BROKEN' and a flush is called resulting in some waitingAckQueue members not being cleaned up correctly. Just a theory at this point, but probably easy to poke around with a heap dump or some related logging around these code paths. > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - >
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372080#comment-17372080 ] Bharath Vissapragada commented on HBASE-26042: -- Any chance you can attach a full jstack? Thanks. [~vjasani] [~dmanning] [~apurtell] FYI. > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136
[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26021: - Component/s: master > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at >
[jira] [Updated] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26021: - Labels: compatibility incompatibility serialization upgrade (was: ) > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Labels: compatibility, incompatibility, serialization, upgrade > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at >
[jira] [Resolved] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada resolved HBASE-26021. -- Fix Version/s: 1.7.1 Resolution: Fixed Merged the change, thanks for the reviews. Spoke with [~vjasani] offline, he mentioned he will confirm whether this patch works for him in his local test setup. I volunteer to do a 1.7.0.1 if the patch works for Viraj. (unless there are any objections). > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 1.7.1 > > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at >
[jira] [Commented] (HBASE-26029) It is not reliable to use nodeDeleted event to track region server's death
[ https://issues.apache.org/jira/browse/HBASE-26029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371044#comment-17371044 ] Bharath Vissapragada commented on HBASE-26029: -- Hooking this up with SCP is neat, should clean up all the dead server management code from ReplicationSourceManager, +1. > It is not reliable to use nodeDeleted event to track region server's death > -- > > Key: HBASE-26029 > URL: https://issues.apache.org/jira/browse/HBASE-26029 > Project: HBase > Issue Type: Bug > Components: Replication, Zookeeper >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > > When implementing HBASE-26011, [~sunxin] pointed out an interesting scenario, > where a region server up and down between two sync requests, then we can not > know the death of the region server. > https://github.com/apache/hbase/pull/3405#discussion_r656720923 > This is a valid point, and when thinking of a solution, I noticed that, the > current zk iplementation has the same problem. Notice that, a watcher on zk > can only be triggered once, so after zk triggers the watcher, and before you > set a new watcher, it is possible that a region server is up and down, and > you will miss the nodeDeleted event for this region server. > I think, the general approach here, which could works for both master based > and zk based replication tracker is that, we should not rely on the tracker > to tell you which region server is dead. Instead, we just provide the list of > live regionservers, and the upper layer should compare this list with the > expected list(for replication, the list should be gotten by listing > replicators), to detect the dead region servers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371034#comment-17371034 ] Bharath Vissapragada commented on HBASE-26021: -- https://github.com/apache/hbase/pull/3435 I did a quick revert in the interest of time ([~vjasani] hope its ok). It was a bit involved due to the other piled up changes on top of it. I put it up for review to see how jenkins reacts to it. I think it could use some additional test coverage for the new interface methods. > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Priority: Major > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at >
[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370424#comment-17370424 ] Bharath Vissapragada commented on HBASE-26021: -- Sorry, late to the party, I was OOO this week and couldn't access jira, catching up on the latest discussion. Agree that 1.7.0 is broken and we should spin up 1.7.0.1 ([~reidchan] I can also help create the release if you are busy, let me know, there is another critical fix HBASE-25984 that I recently committed to branch-1 that is worthy of inclusion). Coming to the fix, isn't the patch incomplete ? If we just serialize HTD instead of TD (=HTD + table state), there is loss of information, right? We are just not seeing it in tests because we cache the table state in TableStateManager, so something like disable table, stop hbase, start hbase should result in an enabled table? Let's properly revert this patch instead? (We may need to add some special code to handle serialized TDs for those who are already on 1.7.0?) > HBase 1.7 to 2.4 upgrade issue due to incompatible deserialization > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Priority: Major > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase meta*" and "*\n hbase namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta >
[jira] [Commented] (HBASE-26021) HBase 1.7 to 2.4 upgrade issue
[ https://issues.apache.org/jira/browse/HBASE-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368469#comment-17368469 ] Bharath Vissapragada commented on HBASE-26021: -- [~vjasani] Thanks for digging. I suspected that commit could cause some edge cases like this. The commit in it's exact form is not needed for the work IIRC (it diverged the code quite a bit, so had to backport it), so we can undo certain parts of the serialization that is incompatible with branch-1. You happen to dig into what exactly is off in the serialization piece? > HBase 1.7 to 2.4 upgrade issue > -- > > Key: HBASE-26021 > URL: https://issues.apache.org/jira/browse/HBASE-26021 > Project: HBase > Issue Type: Bug >Affects Versions: 1.7.0, 2.4.4 >Reporter: Viraj Jasani >Priority: Major > Attachments: Screenshot 2021-06-22 at 12.54.21 PM.png, Screenshot > 2021-06-22 at 12.54.30 PM.png > > > As of today, if we bring up HBase cluster using branch-1 and upgrade to > branch-2.4, we are facing issue while parsing namespace from HDFS fileinfo. > Instead of "*hbase:meta*" and "*hbase:namespace*", parsing using ProtobufUtil > seems to be producing "*\n hbase:\n meta*" and "*\n hbase:\n namespace*" > {code:java} > 2021-06-22 00:05:56,611 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > regionserver.RSRpcServices: Open hbase:meta,,1.1588230740 > 2021-06-22 00:05:56,648 INFO > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > regionserver.RSRpcServices: Open > hbase:namespace,,1624297762817.396cb6cc00cd4334cb1ea3a792d7529a. > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=5,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R namespace > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableDescriptor(ProtobufUtil.java:2937) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.parseFrom(TableDescriptorBuilder.java:1625) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder$ModifyableTableDescriptor.access$200(TableDescriptorBuilder.java:597) > at > org.apache.hadoop.hbase.client.TableDescriptorBuilder.parseFrom(TableDescriptorBuilder.java:320) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.readTableDescriptor(FSTableDescriptors.java:511) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:496) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:482) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:210) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:2112) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:35218) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2021-06-22 00:05:56,759 ERROR > [RpcServer.priority.RWQ.Fifo.read.handler=3,queue=1,port=16025] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: Illegal character < > > at 0. Namespaces may only contain 'alphanumeric characters' from any > > language and digits: > ^Ehbase^R^Dmeta > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:246) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:220) > at org.apache.hadoop.hbase.TableName.(TableName.java:348) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:385) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:508) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toTableName(ProtobufUtil.java:2292) >
[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25998: - Fix Version/s: 2.4.5 2.3.6 2.5.0 3.0.0-alpha-1 Resolution: Fixed Status: Resolved (was: Patch Available) > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.5 > > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25984: - Resolution: Fixed Status: Resolved (was: Patch Available) > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5 > > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25984: - Fix Version/s: 2.4.5 1.7.1 2.3.6 2.5.0 3.0.0-alpha-1 > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5 > > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365060#comment-17365060 ] Bharath Vissapragada commented on HBASE-25984: -- Back port PRs are WIP (PR - jira auto link isn't working, it will catch up soon). https://github.com/apache/hbase/pull/3392 https://github.com/apache/hbase/pull/3393 https://github.com/apache/hbase/pull/3394 > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365057#comment-17365057 ] Bharath Vissapragada commented on HBASE-25998: -- FSHLog doesn't show much improvement in WALPE with the patch, so I believe that is reflected in the YCSB runs too. Unfortunately I'm not able to deploy a branch-2 cluster right now (without much effort) to get the async WAL numbers. I will update here once I have a cluster and some data. > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364520#comment-17364520 ] Bharath Vissapragada edited comment on HBASE-25998 at 6/16/21, 9:21 PM: Thanks [~apurtell] for trying out the patch (and review). One interesting behavior here is that this big throughput difference is only obvious for Async WAL implementation, not clear to me why, perhaps there is a lot more contention in that implementation for some reason. I repeated the same set of tests in branch-1/master based FSHLog and the patch only performs slightly better (few single digit % points). This behavior was also confirmed in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster). Without patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 199938 [OVERALL], Throughput(ops/sec), 50015.50480649001 [TOTAL_GCS_PS_Scavenge], Count, 293 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603 [TOTAL_GCs], Count, 294 [TOTAL_GC_TIME], Time(ms), 1256 [TOTAL_GC_TIME_%], Time(%), 0.6281947403695145 [CLEANUP], Operations, 512 [CLEANUP], AverageLatency(us), 41.0234375 [CLEANUP], MinLatency(us), 0 [CLEANUP], MaxLatency(us), 18527 [CLEANUP], 95thPercentileLatency(us), 13 [CLEANUP], 99thPercentileLatency(us), 37 [INSERT], Operations, 1000 [INSERT], AverageLatency(us), 5085.9494093 [INSERT], MinLatency(us), 1499 [INSERT], MaxLatency(us), 220927 [INSERT], 95thPercentileLatency(us), 6511 [INSERT], 99thPercentileLatency(us), 16655 [INSERT], Return=OK, 1000 {noformat} With patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 195064 [OVERALL], Throughput(ops/sec), 51265.2257720543 [TOTAL_GCS_PS_Scavenge], Count, 284 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.01691752450477792 [TOTAL_GCs], Count, 285 [TOTAL_GC_TIME], Time(ms), 1217 [TOTAL_GC_TIME_%], Time(%), 0.6238977976459008 [CLEANUP], Operations, 512 [CLEANUP], AverageLatency(us), 45.783203125 [CLEANUP], MinLatency(us), 1 [CLEANUP], MaxLatency(us), 20591 [CLEANUP], 95thPercentileLatency(us), 14 [CLEANUP], 99thPercentileLatency(us), 37 [INSERT], Operations, 1000 [INSERT], AverageLatency(us), 4958.6662675 [INSERT], MinLatency(us), 1380 [INSERT], MaxLatency(us), 295935 [INSERT], 95thPercentileLatency(us), 6335 [INSERT], 99thPercentileLatency(us), 19071 [INSERT], Return=OK, 1000 {noformat} Unfortunately, the tooling I have does not support branch-2/master (yet) so that I can repeat this YCSB run for Async WAL implementation but if WALPE runs are any indication, there should be a good enough throughput improvement. was (Author: bharathv): Thanks [~apurtell] for trying out the patch (and review). One interesting behavior here is that this big throughput difference is only obvious for Async WAL implementation, not clear to me why, perhaps there is a lot more contention in that implementation for some reason. I repeated the same set of tests in branch-1/master based FSHLog and the patch only performs slightly better (few single digit % points). This behavior was also confirmed in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster). Without patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 199938 [OVERALL], Throughput(ops/sec), 50015.50480649001 [TOTAL_GCS_PS_Scavenge], Count, 293 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603 [TOTAL_GCs], Count, 294 [TOTAL_GC_TIME], Time(ms), 1256 [TOTAL_GC_TIME_%], Time(%), 0.6281947403695145 [CLEANUP], Operations, 512 [CLEANUP], AverageLatency(us), 41.0234375 [CLEANUP], MinLatency(us), 0 [CLEANUP], MaxLatency(us), 18527 [CLEANUP], 95thPercentileLatency(us), 13 [CLEANUP], 99thPercentileLatency(us), 37 [INSERT], Operations, 1000 [INSERT], AverageLatency(us), 5085.9494093 [INSERT], MinLatency(us), 1499 [INSERT], MaxLatency(us), 220927 [INSERT], 95thPercentileLatency(us), 6511 [INSERT], 99thPercentileLatency(us), 16655 [INSERT], Return=OK, 1000 {noformat} With patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 195064 [OVERALL], Throughput(ops/sec), 51265.2257720543 [TOTAL_GCS_PS_Scavenge], Count, 284 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%),
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364520#comment-17364520 ] Bharath Vissapragada commented on HBASE-25998: -- Thanks [~apurtell] for trying out the patch (and review). One interesting behavior here is that this big throughput difference is only obvious for Async WAL implementation, not clear to me why, perhaps there is a lot more contention in that implementation for some reason. I repeated the same set of tests in branch-1/master based FSHLog and the patch only performs slightly better (few single digit % points). This behavior was also confirmed in the YCSB runs on branch-1 (on a 3 node containerized EC2 cluster). Without patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 199938 [OVERALL], Throughput(ops/sec), 50015.50480649001 [TOTAL_GCS_PS_Scavenge], Count, 293 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1222 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.611189468735308 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 34 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.017005271634206603 [TOTAL_GCs], Count, 294 [TOTAL_GC_TIME], Time(ms), 1256 [TOTAL_GC_TIME_%], Time(%), 0.6281947403695145 [CLEANUP], Operations, 512 [CLEANUP], AverageLatency(us), 41.0234375 [CLEANUP], MinLatency(us), 0 [CLEANUP], MaxLatency(us), 18527 [CLEANUP], 95thPercentileLatency(us), 13 [CLEANUP], 99thPercentileLatency(us), 37 [INSERT], Operations, 1000 [INSERT], AverageLatency(us), 5085.9494093 [INSERT], MinLatency(us), 1499 [INSERT], MaxLatency(us), 220927 [INSERT], 95thPercentileLatency(us), 6511 [INSERT], 99thPercentileLatency(us), 16655 [INSERT], Return=OK, 1000 {noformat} With patch: branch-1/FSHLog (10M ingest only) {noformat} [OVERALL], RunTime(ms), 195064 [OVERALL], Throughput(ops/sec), 51265.2257720543 [TOTAL_GCS_PS_Scavenge], Count, 284 [TOTAL_GC_TIME_PS_Scavenge], Time(ms), 1184 [TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.6069802731411229 [TOTAL_GCS_PS_MarkSweep], Count, 1 [TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 33 [TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.01691752450477792 [TOTAL_GCs], Count, 285 [TOTAL_GC_TIME], Time(ms), 1217 [TOTAL_GC_TIME_%], Time(%), 0.6238977976459008 [CLEANUP], Operations, 512 [CLEANUP], AverageLatency(us), 45.783203125 [CLEANUP], MinLatency(us), 1 [CLEANUP], MaxLatency(us), 20591 [CLEANUP], 95thPercentileLatency(us), 14 [CLEANUP], 99thPercentileLatency(us), 37 [INSERT], Operations, 1000 [INSERT], AverageLatency(us), 4958.6662675 [INSERT], MinLatency(us), 1380 [INSERT], MaxLatency(us), 295935 [INSERT], 95thPercentileLatency(us), 6335 [INSERT], 99thPercentileLatency(us), 19071 [INSERT], Return=OK, 1000 {noformat} Unfortunately, the tooling I have does not support branch-2/master (yet) so that I can repeat this YCSB run for Async WAL implementation but if WALPE runs are any indication, we should be a good enough throughput improvement. > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362670#comment-17362670 ] Bharath Vissapragada commented on HBASE-25998: -- [~stack] Thanks for taking a look, test runs seem fine, trying to do an e-e throughput test on a cluster, will report the results here. > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25998: - Status: Patch Available (was: Open) > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361973#comment-17361973 ] Bharath Vissapragada commented on HBASE-25998: -- Redid the experiments with JDK-11 (to account for any latest monitor performance enhancements) and I see similar numbers. Also, the numbers above are for {{-t 256}} which implies heavy contention. It seems like the patch performs well under heavy load and the gap narrows with fewer threads (which I guess is expected), but even with very low concurrency the patch seems to out perform the current state. > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361867#comment-17361867 ] Bharath Vissapragada commented on HBASE-25998: -- [~zhangduo] [~apurtell] [~stack] might of interest to you (draft patch up for review), results seem too good to be true. If you don't mind trying the patch locally in your environment (just want to eliminate any noise from my end).. PTAL. > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361864#comment-17361864 ] Bharath Vissapragada commented on HBASE-25998: -- {noformat} java -version java version "1.8.0_221" Java(TM) SE Runtime Environment (build 1.8.0_221-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode) {noformat} For default WAL provider (async WAL) Without Patch {noformat} -- Histograms -- org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.latencyHistogram.nanos count = 10271257 min = 2672827 max = 67700701 mean = 4084532.41 stddev = 6244597.80 median = 3403047.00 75% <= 3525394.00 95% <= 3849268.00 98% <= 4319378.00 99% <= 61134500.00 99.9% <= 67195663.00 org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncCountHistogram.countPerSync count = 100888 min = 52 max = 103 mean = 101.91 stddev = 2.09 median = 102.00 75% <= 102.00 95% <= 102.00 98% <= 102.00 99% <= 103.00 99.9% <= 103.00 org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncHistogram.nanos-between-syncs count = 100889 min = 119051 max = 62778058 mean = 1601305.10 stddev = 3626948.72 median = 1361530.00 75% <= 1407052.00 95% <= 1523418.00 98% <= 1765310.00 99% <= 2839178.00 99.9% <= 62778058.00 -- Meters -- org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.appendMeter.bytes count = 5721241096 mean rate = 37890589.06 events/second 1-minute rate = 36390169.75 events/second 5-minute rate = 33524039.88 events/second 15-minute rate = 31915066.49 events/second org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncMeter.syncs count = 100889 mean rate = 668.16 events/second 1-minute rate = 641.77 events/second 5-minute rate = 590.37 events/second 15-minute rate = 561.67 events/second {noformat} With patch: {noformat} -- Histograms -- org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.latencyHistogram.nanos count = 12927042 min = 943723 max = 60827209 mean = 1865217.32 stddev = 5384907.53 median = 1323691.00 75% <= 1443195.00 95% <= 1765866.00 98% <= 1921920.00 99% <= 3144643.00 99.9% <= 60827209.00 org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncCountHistogram.countPerSync count = 126797 min = 52 max = 104 mean = 101.87 stddev = 2.54 median = 102.00 75% <= 102.00 95% <= 102.00 98% <= 103.00 99% <= 103.00 99.9% <= 103.00 org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncHistogram.nanos-between-syncs count = 126798 min = 122666 max = 60703608 mean = 711847.31 stddev = 3174375.63 median = 519092.00 75% <= 570240.00 95% <= 695175.00 98% <= 754972.00 99% <= 791139.00 99.9% <= 59975393.00 -- Meters -- org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.appendMeter.bytes count = 7200681555 mean rate = 79170095.16 events/second 1-minute rate = 75109969.27 events/second 5-minute rate = 66505621.40 events/second 15-minute rate = 63719949.74 events/second org.apache.hadoop.hbase.wal.WALPerformanceEvaluation.syncMeter.syncs count = 126800 mean rate = 1394.11 events/second 1-minute rate = 1322.31 events/second 5-minute rate = 1169.99 events/second 15-minute rate = 1120.69 events/second {noformat} > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png,
[jira] [Updated] (HBASE-25998) Revisit synchronization in SyncFuture
[ https://issues.apache.org/jira/browse/HBASE-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25998: - Attachment: monitor-overhead-2.png monitor-overhead-1.png > Revisit synchronization in SyncFuture > - > > Key: HBASE-25998 > URL: https://issues.apache.org/jira/browse/HBASE-25998 > Project: HBase > Issue Type: Improvement > Components: Performance, regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Major > Attachments: monitor-overhead-1.png, monitor-overhead-2.png > > > While working on HBASE-25984, I noticed some weird frames in the flame graphs > around monitor entry exit consuming a lot of CPU cycles (see attached > images). Noticed that the synchronization there is too coarse grained and > sometimes unnecessary. I did a simple patch that switched to a reentrant lock > based synchronization with condition variable rather than a busy wait and > that showed 70-80% increased throughput in WAL PE. Seems too good to be > true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25998) Revisit synchronization in SyncFuture
Bharath Vissapragada created HBASE-25998: Summary: Revisit synchronization in SyncFuture Key: HBASE-25998 URL: https://issues.apache.org/jira/browse/HBASE-25998 Project: HBase Issue Type: Improvement Components: Performance, regionserver, wal Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada While working on HBASE-25984, I noticed some weird frames in the flame graphs around monitor entry exit consuming a lot of CPU cycles (see attached images). Noticed that the synchronization there is too coarse grained and sometimes unnecessary. I did a simple patch that switched to a reentrant lock based synchronization with condition variable rather than a busy wait and that showed 70-80% increased throughput in WAL PE. Seems too good to be true.. (more details in the comments). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
[ https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361144#comment-17361144 ] Bharath Vissapragada commented on HBASE-25924: -- [~shahrs87] I think it was committed and reverted from 2.3, still would be nice to have a working patch in that branch (if you have spare cycles) :-). > Seeing a spike in uncleanlyClosedWALs metric. > - > > Key: HBASE-25924 > URL: https://issues.apache.org/jira/browse/HBASE-25924 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1 > > > Getting the following log line in all of our production clusters when > WALEntryStream is dequeuing WAL file. > {noformat} > 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - > Reached the end of WAL file hdfs://. It was not closed > cleanly, so we did not parse 8 bytes of data. This is normally ok. > {noformat} > The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + > "LAWP" (4 bytes) = 8 bytes) > While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. > [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] > {code:java} > private void tryAdvanceEntry() throws IOException { > if (checkReader()) { > readNextEntryAndSetPosition(); > if (currentEntry == null) { // no more entries in this log file - see > if log was rolled > if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled > // Before dequeueing, we should always get one more attempt at > reading. > // This is in case more entries came in after we opened the reader, > // and a new log was enqueued while we were reading. See HBASE-6758 > resetReader(); ---> HERE > readNextEntryAndSetPosition(); > if (currentEntry == null) { > if (checkAllBytesParsed()) { // now we're certain we're done with > this log file > dequeueCurrentLog(); > if (openNextLog()) { > readNextEntryAndSetPosition(); > } > } > } > } // no other logs, we've simply hit the end of the current open log. > Do nothing > } > } > // do nothing if we don't have a WAL Reader (e.g. if there's no logs in > queue) > } > {code} > In resetReader, we call the following methods, WALEntryStream#resetReader > > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. > In ProtobufLogReader#initInternal, we try to create the whole reader object > from scratch to see if any new data has been written. > We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. > We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360383#comment-17360383 ] Bharath Vissapragada commented on HBASE-25984: -- After reading the AsyncFSWAL implementation, I think the overwrites of the futures are possible but it does not cause a deadlock because of the way the safe point is attained. I uploaded a draft patch that removes the usage of ThreadLocals for both the WAL implementations. It only matters for the FSHLog implementation but in general it seems risky to use ThreadLocals that are prone to overwrites. [~zhangduo] Any thoughts? > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359575#comment-17359575 ] Bharath Vissapragada edited comment on HBASE-25984 at 6/8/21, 9:39 PM: --- Obtained a heap dump and poked around quite a bit and observed that in-memory state of the {{RingBufferEventHandler}} is corrupted. After analyzing it and eyeballing the code (long story short), came up with the following theory... This is the current implementation of SyncFuture#reset() {noformat} 1 synchronized SyncFuture reset(final long sequence, Span span) { 2if (t != null && t != Thread.currentThread()) throw new IllegalStateException(); 3t = Thread.currentThread(); 4if (!isDone()) throw new IllegalStateException("" + sequence + " " + Thread.currentThread()); 5this.doneSequence = NOT_DONE; 6this.ringBufferSequence = sequence; 7this.span = span; 8this.throwable = null; 9return this; 10 } {noformat} We can see, there are guards against overwriting un-finished sync futures with checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, consider the following sequence of interleaved actions that can result in a deadlock.. # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to RingBuffer - state: NOT_DONE # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, new state: DONE (via releaseSyncFuture() in SyncRunner.run()) # RPCHandler#1 (reused for another region operation) → ThreadLocal SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now returns true and L2 always returns true because we reuse the ThreadLocal instance*). # RingBufferConsumer#onEvent() - attainSafePoint() -> works on the overwritten sync future state (from step 3) # DEADLOCK === (as the sync future remains in the state forever) The problem here is that once a SyncFuture is marked DONE, it is eligible for reuse. Even though ring buffer still has references to it and is checking on it to attain a safe point, in the background another handler can just overwrite it resulting in ring buffer operating on a wrong future and deadlocking the system. Very subtle bug.. I'm able to reproduce this in a carefully crafted unit test (attached). I have a fix solves this problem, will create a PR for review soon. was (Author: bharathv): Obtained a heap dump and poked around quite a bit and observed that in-memory state is corrupted. After analyzing it and eyeballing the code (long story short), came up with the following theory... This is the current implementation of SyncFuture#reset() {noformat} 1 synchronized SyncFuture reset(final long sequence, Span span) { 2if (t != null && t != Thread.currentThread()) throw new IllegalStateException(); 3t = Thread.currentThread(); 4if (!isDone()) throw new IllegalStateException("" + sequence + " " + Thread.currentThread()); 5this.doneSequence = NOT_DONE; 6this.ringBufferSequence = sequence; 7this.span = span; 8this.throwable = null; 9return this; 10 } {noformat} We can see, there are guards against overwriting un-finished sync futures with checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, consider the following sequence of interleaved actions that can result in a deadlock.. # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to RingBuffer - state: NOT_DONE # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, new state: DONE (via releaseSyncFuture() in SyncRunner.run()) # RPCHandler#1 (reused for another region operation) → ThreadLocal SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now returns true and L2 always returns true because we reuse the ThreadLocal instance*). # RingBufferConsumer#onEvent() - attainSafePoint() -> works on the overwritten sync future state (from step 3) # DEADLOCK === (as the sync future remains in the state forever) The problem here is that once a SyncFuture is marked DONE, it is eligible for reuse. Even though ring buffer still has references to it and is checking on it to attain a safe point, in the background another handler can just overwrite it resulting in ring buffer operating on a wrong future and deadlocking the system. Very subtle bug.. I'm able to reproduce this in a carefully crafted unit test (attached). I have a fix solves this problem, will create a PR for review soon. > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >
[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25984: - Attachment: HBASE-25984-unit-test.patch > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > Attachments: HBASE-25984-unit-test.patch > > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359579#comment-17359579 ] Bharath Vissapragada commented on HBASE-25984: -- This has been a recurring problem for us under load deadlocking a bunch of region servers and severely dropping the ingestion throughput until our probes trigger a force abort of the RS pods and the throughput recovers. I cannot share the heap dump for obvious reasons but can share any other details that reviewers might be interested in. > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359575#comment-17359575 ] Bharath Vissapragada commented on HBASE-25984: -- Obtained a heap dump and poked around quite a bit and observed that in-memory state is corrupted. After analyzing it and eyeballing the code (long story short), came up with the following theory... This is the current implementation of SyncFuture#reset() {noformat} 1 synchronized SyncFuture reset(final long sequence, Span span) { 2if (t != null && t != Thread.currentThread()) throw new IllegalStateException(); 3t = Thread.currentThread(); 4if (!isDone()) throw new IllegalStateException("" + sequence + " " + Thread.currentThread()); 5this.doneSequence = NOT_DONE; 6this.ringBufferSequence = sequence; 7this.span = span; 8this.throwable = null; 9return this; 10 } {noformat} We can see, there are guards against overwriting un-finished sync futures with checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, consider the following sequence of interleaved actions that can result in a deadlock.. # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to RingBuffer - state: NOT_DONE # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, new state: DONE (via releaseSyncFuture() in SyncRunner.run()) # RPCHandler#1 (reused for another region operation) → ThreadLocal SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now returns true and L2 always returns true because we reuse the ThreadLocal instance*). # RingBufferConsumer#onEvent() - attainSafePoint() -> works on the overwritten sync future state (from step 3) # DEADLOCK === (as the sync future remains in the state forever) The problem here is that once a SyncFuture is marked DONE, it is eligible for reuse. Even though ring buffer still has references to it and is checking on it to attain a safe point, in the background another handler can just overwrite it resulting in ring buffer operating on a wrong future and deadlocking the system. Very subtle bug.. I'm able to reproduce this in a carefully crafted unit test (attached). I have a fix solves this problem, will create a PR for review soon. > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > >
[jira] [Updated] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
[ https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25984: - Labels: deadlock hang (was: ) > FSHLog WAL lockup with sync future reuse [RS deadlock] > -- > > Key: HBASE-25984 > URL: https://issues.apache.org/jira/browse/HBASE-25984 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Critical > Labels: deadlock, hang > > We use FSHLog as the WAL implementation (branch-1 based) and under heavy load > we noticed the WAL system gets locked up due to a subtle bug involving racy > code with sync future reuse. This bug applies to all FSHLog implementations > across branches. > Symptoms: > On heavily loaded clusters with large write load we noticed that the region > servers are hanging abruptly with filled up handler queues and stuck MVCC > indicating appends/syncs not making any progress. > {noformat} > WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, > regionName=1ce4003ab60120057734ffe367667dca} > WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - > STUCK for : 296000 millis. > MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, > regionName=7c441d7243f9f504194dae6bf2622631} > {noformat} > All the handlers are stuck waiting for the sync futures and timing out. > {noformat} > java.lang.Object.wait(Native Method) > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) > . > {noformat} > Log rolling is stuck because it was unable to attain a safe point > {noformat} >java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) > {noformat} > and the Ring buffer consumer thinks that there are some outstanding syncs > that need to finish.. > {noformat} > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > {noformat} > On the other hand, SyncRunner threads are idle and just waiting for work > implying that there are no pending SyncFutures that need to be run > {noformat} >sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) > java.lang.Thread.run(Thread.java:748) > {noformat} > Overall the WAL system is dead locked and could make no progress until it was > aborted. I got to the bottom of this issue and have a patch that can fix it > (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]
Bharath Vissapragada created HBASE-25984: Summary: FSHLog WAL lockup with sync future reuse [RS deadlock] Key: HBASE-25984 URL: https://issues.apache.org/jira/browse/HBASE-25984 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5 Reporter: Bharath Vissapragada Assignee: Bharath Vissapragada We use FSHLog as the WAL implementation (branch-1 based) and under heavy load we noticed the WAL system gets locked up due to a subtle bug involving racy code with sync future reuse. This bug applies to all FSHLog implementations across branches. Symptoms: On heavily loaded clusters with large write load we noticed that the region servers are hanging abruptly with filled up handler queues and stuck MVCC indicating appends/syncs not making any progress. {noformat} WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, regionName=1ce4003ab60120057734ffe367667dca} WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, regionName=7c441d7243f9f504194dae6bf2622631} {noformat} All the handlers are stuck waiting for the sync futures and timing out. {noformat} java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) . {noformat} Log rolling is stuck because it was unable to attain a safe point {noformat} java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900) {noformat} and the Ring buffer consumer thinks that there are some outstanding syncs that need to finish.. {noformat} org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) {noformat} On the other hand, SyncRunner threads are idle and just waiting for work implying that there are no pending SyncFutures that need to be run {noformat} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) java.lang.Thread.run(Thread.java:748) {noformat} Overall the WAL system is dead locked and could make no progress until it was aborted. I got to the bottom of this issue and have a patch that can fix it (more details in the comments due to word limit in the description). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick
[ https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358951#comment-17358951 ] Bharath Vissapragada commented on HBASE-24440: -- Thanks for the detailed results. > These implementations were microbenchmarked to identify the superior option, > which seems to be BoundedIncrementYieldAdvancingClock. Kind of surprised. I guessed it would be IncrementAdvancingClock theoretically (since there is no overhead to maintain boundedness), still trying to wrap my head around the subtle details in the patch. > We are tracking whether or not a row has already been committed to in the > current clock tick. Or, we are tracking per region a set of rows per clock > tick. Why not track that explicitly? This is an interesting idea, trying to think about it a bit more. IIUC the idea here is to maximize the commit throughput as much as possible in a single tick by picking from the available pool of disjoint row keys that can share the same tick. Let me take a look at the patch.. > Prevent temporal misordering on timescales smaller than one clock tick > -- > > Key: HBASE-24440 > URL: https://issues.apache.org/jira/browse/HBASE-24440 > Project: HBase > Issue Type: Brainstorming >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0 > > > When mutations are sent to the servers without a timestamp explicitly > assigned by the client the server will substitute the current wall clock > time. There are edge cases where it is at least theoretically possible for > more than one mutation to be committed to a given row within the same clock > tick. When this happens we have to track and preserve the ordering of these > mutations in some other way besides the timestamp component of the key. Let > me bypass most discussion here by noting that whether we do this or not, we > do not pass such ordering information in the cross cluster replication > protocol. We also have interesting edge cases regarding key type precedence > when mutations arrive "simultaneously": we sort deletes ahead of puts. This, > especially in the presence of replication, can lead to visible anomalies for > clients able to interact with both source and sink. > There is a simple solution that removes the possibility that these edge cases > can occur: > We can detect, when we are about to commit a mutation to a row, if we have > already committed a mutation to this same row in the current clock tick. > Occurrences of this condition will be rare. We are already tracking current > time. We have to know this in order to assign the timestamp. Where this > becomes interesting is how we might track the last commit time per row. > Making the detection of this case efficient for the normal code path is the > bulk of the challenge. One option is to keep track of the last locked time > for row locks. (Todo: How would we track and garbage collect this efficiently > and correctly. Not the ideal option.) We might also do this tracking somehow > via the memstore. (At least in this case the lifetime and distribution of in > memory row state, including the proposed timestamps, would align.) Assuming > we can efficiently know if we are about to commit twice to the same row > within a single clock tick, we would simply sleep/yield the current thread > until the clock ticks over, and then proceed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
[ https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada resolved HBASE-25924. -- Resolution: Fixed > Seeing a spike in uncleanlyClosedWALs metric. > - > > Key: HBASE-25924 > URL: https://issues.apache.org/jira/browse/HBASE-25924 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1 > > > Getting the following log line in all of our production clusters when > WALEntryStream is dequeuing WAL file. > {noformat} > 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - > Reached the end of WAL file hdfs://. It was not closed > cleanly, so we did not parse 8 bytes of data. This is normally ok. > {noformat} > The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + > "LAWP" (4 bytes) = 8 bytes) > While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. > [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] > {code:java} > private void tryAdvanceEntry() throws IOException { > if (checkReader()) { > readNextEntryAndSetPosition(); > if (currentEntry == null) { // no more entries in this log file - see > if log was rolled > if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled > // Before dequeueing, we should always get one more attempt at > reading. > // This is in case more entries came in after we opened the reader, > // and a new log was enqueued while we were reading. See HBASE-6758 > resetReader(); ---> HERE > readNextEntryAndSetPosition(); > if (currentEntry == null) { > if (checkAllBytesParsed()) { // now we're certain we're done with > this log file > dequeueCurrentLog(); > if (openNextLog()) { > readNextEntryAndSetPosition(); > } > } > } > } // no other logs, we've simply hit the end of the current open log. > Do nothing > } > } > // do nothing if we don't have a WAL Reader (e.g. if there's no logs in > queue) > } > {code} > In resetReader, we call the following methods, WALEntryStream#resetReader > > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. > In ProtobufLogReader#initInternal, we try to create the whole reader object > from scratch to see if any new data has been written. > We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. > We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.
[ https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25932: - Component/s: wal metrics > TestWALEntryStream#testCleanClosedWALs test is failing. > --- > > Key: HBASE-25932 > URL: https://issues.apache.org/jira/browse/HBASE-25932 > Project: HBase > Issue Type: Bug > Components: metrics, Replication, wal >Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4 >Reporter: Rushabh Shah >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4 > > Attachments: HBASE-25932-test-approach.patch > > > We are seeing the following test failure. > TestWALEntryStream#testCleanClosedWALs > This test was added in HBASE-25924. I don't think the test failure has > anything to do with the patch in HBASE-25924. > Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In > all the branches, we were not parsing the wal trailer when we close the wal > reader inside ReplicationSourceWALReader thread. The root cause was when we > add active WAL to ReplicationSourceWALReader, we cache the file size when the > wal was being actively written and once the wal was closed and replicated and > removed from WALEntryStream, we did reset the ProtobufLogReader object but we > didn't update the length of the wal and that was causing EOF errors since it > can't find the WALTrailer with the stale wal file length. > The fix applied nicely to branch-1 since we use FSHlog implementation which > closes the WAL file sychronously. > But in branch-2 and master, we use _AsyncFSWAL_ implementation and the > closing of wal file is done asynchronously (as the name suggests). This is > causing the test to fail. Below is the test. > {code:java} > @Test > public void testCleanClosedWALs() throws Exception { > try (WALEntryStream entryStream = new WALEntryStream( > logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) { > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > appendToLogAndSync(); > assertNotNull(entryStream.next()); > log.rollWriter(); ===> This does an asynchronous close of wal. > appendToLogAndSync(); > assertNotNull(entryStream.next()); > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > } > } > {code} > In the above code, when we roll writer, we don't close the old wal file > immediately so the ReplicationReader thread is not able to get the updated > wal file size and that is throwing EOF errors. > If I add a sleep of few milliseconds (1 ms in my local env) between > rollWriter and appendToLogAndSync statement then the test passes but this is > *not* a proper fix since we are working around the race between > ReplicationSourceWALReaderThread and closing of WAL file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
[ https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25924: - Fix Version/s: (was: 2.3.6) > Seeing a spike in uncleanlyClosedWALs metric. > - > > Key: HBASE-25924 > URL: https://issues.apache.org/jira/browse/HBASE-25924 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4, 1.7.1 > > > Getting the following log line in all of our production clusters when > WALEntryStream is dequeuing WAL file. > {noformat} > 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - > Reached the end of WAL file hdfs://. It was not closed > cleanly, so we did not parse 8 bytes of data. This is normally ok. > {noformat} > The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + > "LAWP" (4 bytes) = 8 bytes) > While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. > [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] > {code:java} > private void tryAdvanceEntry() throws IOException { > if (checkReader()) { > readNextEntryAndSetPosition(); > if (currentEntry == null) { // no more entries in this log file - see > if log was rolled > if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled > // Before dequeueing, we should always get one more attempt at > reading. > // This is in case more entries came in after we opened the reader, > // and a new log was enqueued while we were reading. See HBASE-6758 > resetReader(); ---> HERE > readNextEntryAndSetPosition(); > if (currentEntry == null) { > if (checkAllBytesParsed()) { // now we're certain we're done with > this log file > dequeueCurrentLog(); > if (openNextLog()) { > readNextEntryAndSetPosition(); > } > } > } > } // no other logs, we've simply hit the end of the current open log. > Do nothing > } > } > // do nothing if we don't have a WAL Reader (e.g. if there's no logs in > queue) > } > {code} > In resetReader, we call the following methods, WALEntryStream#resetReader > > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. > In ProtobufLogReader#initInternal, we try to create the whole reader object > from scratch to see if any new data has been written. > We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. > We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
[ https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355221#comment-17355221 ] Bharath Vissapragada commented on HBASE-25924: -- Thanks, opened HBASE-25957 as a subtask. > Seeing a spike in uncleanlyClosedWALs metric. > - > > Key: HBASE-25924 > URL: https://issues.apache.org/jira/browse/HBASE-25924 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4, 1.7.1 > > > Getting the following log line in all of our production clusters when > WALEntryStream is dequeuing WAL file. > {noformat} > 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - > Reached the end of WAL file hdfs://. It was not closed > cleanly, so we did not parse 8 bytes of data. This is normally ok. > {noformat} > The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + > "LAWP" (4 bytes) = 8 bytes) > While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. > [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] > {code:java} > private void tryAdvanceEntry() throws IOException { > if (checkReader()) { > readNextEntryAndSetPosition(); > if (currentEntry == null) { // no more entries in this log file - see > if log was rolled > if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled > // Before dequeueing, we should always get one more attempt at > reading. > // This is in case more entries came in after we opened the reader, > // and a new log was enqueued while we were reading. See HBASE-6758 > resetReader(); ---> HERE > readNextEntryAndSetPosition(); > if (currentEntry == null) { > if (checkAllBytesParsed()) { // now we're certain we're done with > this log file > dequeueCurrentLog(); > if (openNextLog()) { > readNextEntryAndSetPosition(); > } > } > } > } // no other logs, we've simply hit the end of the current open log. > Do nothing > } > } > // do nothing if we don't have a WAL Reader (e.g. if there's no logs in > queue) > } > {code} > In resetReader, we call the following methods, WALEntryStream#resetReader > > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. > In ProtobufLogReader#initInternal, we try to create the whole reader object > from scratch to see if any new data has been written. > We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. > We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3
[ https://issues.apache.org/jira/browse/HBASE-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355219#comment-17355219 ] Bharath Vissapragada commented on HBASE-25957: -- [~rushabh.shah] FYI, if you have any free cycles. > Back port HBASE-25924 and its related patches to branch-2.3 > --- > > Key: HBASE-25957 > URL: https://issues.apache.org/jira/browse/HBASE-25957 > Project: HBase > Issue Type: Sub-task > Components: metrics, Replication, wal >Affects Versions: 2.3.6 >Reporter: Bharath Vissapragada >Priority: Major > > The branch has diverged a bit and the patches are not applying clean. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3
[ https://issues.apache.org/jira/browse/HBASE-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25957: - Component/s: wal Replication metrics > Back port HBASE-25924 and its related patches to branch-2.3 > --- > > Key: HBASE-25957 > URL: https://issues.apache.org/jira/browse/HBASE-25957 > Project: HBase > Issue Type: Sub-task > Components: metrics, Replication, wal >Affects Versions: 2.3.6 >Reporter: Bharath Vissapragada >Priority: Major > > The branch has diverged a bit and the patches are not applying clean. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25957) Back port HBASE-25924 and its related patches to branch-2.3
Bharath Vissapragada created HBASE-25957: Summary: Back port HBASE-25924 and its related patches to branch-2.3 Key: HBASE-25957 URL: https://issues.apache.org/jira/browse/HBASE-25957 Project: HBase Issue Type: Sub-task Affects Versions: 2.3.6 Reporter: Bharath Vissapragada The branch has diverged a bit and the patches are not applying clean. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
[ https://issues.apache.org/jira/browse/HBASE-25924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355178#comment-17355178 ] Bharath Vissapragada commented on HBASE-25924: -- HBASE-25932 is now committed to master/branch-2/branch2.4. [~apurtell] Whats the general guidance on back porting to branch-2.3? That branch has diverged quite a bit and this patch doesn't apply cleanly. > Seeing a spike in uncleanlyClosedWALs metric. > - > > Key: HBASE-25924 > URL: https://issues.apache.org/jira/browse/HBASE-25924 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4, 1.7.1 > > > Getting the following log line in all of our production clusters when > WALEntryStream is dequeuing WAL file. > {noformat} > 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - > Reached the end of WAL file hdfs://. It was not closed > cleanly, so we did not parse 8 bytes of data. This is normally ok. > {noformat} > The 8 bytes are usually the trailer serialized size (SIZE_OF_INT (4bytes) + > "LAWP" (4 bytes) = 8 bytes) > While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. > [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] > {code:java} > private void tryAdvanceEntry() throws IOException { > if (checkReader()) { > readNextEntryAndSetPosition(); > if (currentEntry == null) { // no more entries in this log file - see > if log was rolled > if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled > // Before dequeueing, we should always get one more attempt at > reading. > // This is in case more entries came in after we opened the reader, > // and a new log was enqueued while we were reading. See HBASE-6758 > resetReader(); ---> HERE > readNextEntryAndSetPosition(); > if (currentEntry == null) { > if (checkAllBytesParsed()) { // now we're certain we're done with > this log file > dequeueCurrentLog(); > if (openNextLog()) { > readNextEntryAndSetPosition(); > } > } > } > } // no other logs, we've simply hit the end of the current open log. > Do nothing > } > } > // do nothing if we don't have a WAL Reader (e.g. if there's no logs in > queue) > } > {code} > In resetReader, we call the following methods, WALEntryStream#resetReader > > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. > In ProtobufLogReader#initInternal, we try to create the whole reader object > from scratch to see if any new data has been written. > We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. > We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.
[ https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25932: - Resolution: Fixed Status: Resolved (was: Patch Available) > TestWALEntryStream#testCleanClosedWALs test is failing. > --- > > Key: HBASE-25932 > URL: https://issues.apache.org/jira/browse/HBASE-25932 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4 >Reporter: Rushabh Shah >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4 > > Attachments: HBASE-25932-test-approach.patch > > > We are seeing the following test failure. > TestWALEntryStream#testCleanClosedWALs > This test was added in HBASE-25924. I don't think the test failure has > anything to do with the patch in HBASE-25924. > Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In > all the branches, we were not parsing the wal trailer when we close the wal > reader inside ReplicationSourceWALReader thread. The root cause was when we > add active WAL to ReplicationSourceWALReader, we cache the file size when the > wal was being actively written and once the wal was closed and replicated and > removed from WALEntryStream, we did reset the ProtobufLogReader object but we > didn't update the length of the wal and that was causing EOF errors since it > can't find the WALTrailer with the stale wal file length. > The fix applied nicely to branch-1 since we use FSHlog implementation which > closes the WAL file sychronously. > But in branch-2 and master, we use _AsyncFSWAL_ implementation and the > closing of wal file is done asynchronously (as the name suggests). This is > causing the test to fail. Below is the test. > {code:java} > @Test > public void testCleanClosedWALs() throws Exception { > try (WALEntryStream entryStream = new WALEntryStream( > logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) { > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > appendToLogAndSync(); > assertNotNull(entryStream.next()); > log.rollWriter(); ===> This does an asynchronous close of wal. > appendToLogAndSync(); > assertNotNull(entryStream.next()); > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > } > } > {code} > In the above code, when we roll writer, we don't close the old wal file > immediately so the ReplicationReader thread is not able to get the updated > wal file size and that is throwing EOF errors. > If I add a sleep of few milliseconds (1 ms in my local env) between > rollWriter and appendToLogAndSync statement then the test passes but this is > *not* a proper fix since we are working around the race between > ReplicationSourceWALReaderThread and closing of WAL file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.
[ https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25932: - Fix Version/s: 2.4.4 2.5.0 3.0.0-alpha-1 > TestWALEntryStream#testCleanClosedWALs test is failing. > --- > > Key: HBASE-25932 > URL: https://issues.apache.org/jira/browse/HBASE-25932 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4 >Reporter: Rushabh Shah >Assignee: Bharath Vissapragada >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4 > > Attachments: HBASE-25932-test-approach.patch > > > We are seeing the following test failure. > TestWALEntryStream#testCleanClosedWALs > This test was added in HBASE-25924. I don't think the test failure has > anything to do with the patch in HBASE-25924. > Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In > all the branches, we were not parsing the wal trailer when we close the wal > reader inside ReplicationSourceWALReader thread. The root cause was when we > add active WAL to ReplicationSourceWALReader, we cache the file size when the > wal was being actively written and once the wal was closed and replicated and > removed from WALEntryStream, we did reset the ProtobufLogReader object but we > didn't update the length of the wal and that was causing EOF errors since it > can't find the WALTrailer with the stale wal file length. > The fix applied nicely to branch-1 since we use FSHlog implementation which > closes the WAL file sychronously. > But in branch-2 and master, we use _AsyncFSWAL_ implementation and the > closing of wal file is done asynchronously (as the name suggests). This is > causing the test to fail. Below is the test. > {code:java} > @Test > public void testCleanClosedWALs() throws Exception { > try (WALEntryStream entryStream = new WALEntryStream( > logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) { > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > appendToLogAndSync(); > assertNotNull(entryStream.next()); > log.rollWriter(); ===> This does an asynchronous close of wal. > appendToLogAndSync(); > assertNotNull(entryStream.next()); > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > } > } > {code} > In the above code, when we roll writer, we don't close the old wal file > immediately so the ReplicationReader thread is not able to get the updated > wal file size and that is throwing EOF errors. > If I add a sleep of few milliseconds (1 ms in my local env) between > rollWriter and appendToLogAndSync statement then the test passes but this is > *not* a proper fix since we are working around the race between > ReplicationSourceWALReaderThread and closing of WAL file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25942) Get rid of null regioninfo in wrapped connection exceptions
[ https://issues.apache.org/jira/browse/HBASE-25942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25942: - Fix Version/s: 2.4.4 2.5.0 3.0.0-alpha-1 > Get rid of null regioninfo in wrapped connection exceptions > --- > > Key: HBASE-25942 > URL: https://issues.apache.org/jira/browse/HBASE-25942 > Project: HBase > Issue Type: Improvement > Components: logging >Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.4 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Trivial > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4 > > > I was debugging something and noticed this, > {noformat} > org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: > java.io.IOException: Call to address=localhost:59142null failed on local > exception: java.io.IOException: Connection reset by peer > {noformat} > `null` there is regioninfo and because this call has nothing set on it, it > was confusing where that null is coming from, added recently via HBASE-25735. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25942) Get rid of null regioninfo in wrapped connection exceptions
[ https://issues.apache.org/jira/browse/HBASE-25942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-25942: - Resolution: Fixed Status: Resolved (was: Patch Available) > Get rid of null regioninfo in wrapped connection exceptions > --- > > Key: HBASE-25942 > URL: https://issues.apache.org/jira/browse/HBASE-25942 > Project: HBase > Issue Type: Improvement > Components: logging >Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.4 >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Trivial > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4 > > > I was debugging something and noticed this, > {noformat} > org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: > java.io.IOException: Call to address=localhost:59142null failed on local > exception: java.io.IOException: Connection reset by peer > {noformat} > `null` there is regioninfo and because this call has nothing set on it, it > was confusing where that null is coming from, added recently via HBASE-25735. -- This message was sent by Atlassian Jira (v8.3.4#803005)