[jira] [Commented] (HBASE-25469) Add detailed RIT info in JSON format for consumption as metrics

2021-08-04 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393556#comment-17393556
 ] 

Michael Stack commented on HBASE-25469:
---

I think no [~apurtell]  Just bug fixes for branch we want to EOL  (Will catch 
this nice improvement on upgrade to branch-2.4.). Thanks.

> Add detailed RIT info in JSON format for consumption as metrics
> ---
>
> Key: HBASE-25469
> URL: https://issues.apache.org/jira/browse/HBASE-25469
> Project: HBase
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 3.0.0-alpha-1, 2.4.6, 2.3.7
>Reporter: Caroline Zhou
>Assignee: Caroline Zhou
>Priority: Minor
>  Labels: observability
> Fix For: 2.5.0, 3.0.0-alpha-2, 1.7.2, 2.4.6
>
> Attachments: Screen Shot 2021-07-27 at 10.34.45.png, Screen Shot 
> 2021-07-27 at 10.34.53.png
>
>
> In HBase 2.1+, there is a RIT jsp page that was added as part of HBASE-21410.
> There are some additional RIT details that would be helpful to have in one 
> place:
>  * RIT Start Time
>  * RIT Duration (ms)
>  * Server
>  * Procedure Type
> This info can be added to the table under the {{/rit.jsp}} page, and we can 
> also add a button on that page to view info as JSON, for easy parsing into 
> metrics, etc. This JSON dump can be served as a servlet.
> We may also consider different ways of grouping the JSON results, such as by 
> state, table, or server name, and/or adding counts of RIT by state or server 
> name.
> !Screen Shot 2021-07-27 at 10.34.45.png!
> !Screen Shot 2021-07-27 at 10.34.53.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26153) [create-release] Use cmd-line defined env vars

2021-08-04 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26153.
---
Fix Version/s: 3.0.0-alpha-2
 Assignee: Michael Stack
   Resolution: Fixed

Merged trivial create-release script changes.

> [create-release] Use cmd-line defined env vars
> --
>
> Key: HBASE-26153
> URL: https://issues.apache.org/jira/browse/HBASE-26153
> Project: HBase
>  Issue Type: Improvement
>  Components: RC
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Trivial
> Fix For: 3.0.0-alpha-2
>
>
> Minor item. The create-release scripts allows defining some of the variables 
> used on the command line but not all. Fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26165) 2.3.5 listed on website downloads page but row intends to be for 2.3.6

2021-08-03 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392577#comment-17392577
 ] 

Michael Stack commented on HBASE-26165:
---

Thanks boys.

> 2.3.5 listed on website downloads page but row intends to be for 2.3.6
> --
>
> Key: HBASE-26165
> URL: https://issues.apache.org/jira/browse/HBASE-26165
> Project: HBase
>  Issue Type: Task
>  Components: website
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0-alpha-2
>
>
> Typo on downloads.html. Row is for 2.3.6 but still says 2.3.5.
> Missed in HBASE-26162. PR coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26162) Release 2.3.6

2021-08-02 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26162.
---
Fix Version/s: 2.3.7
 Assignee: Michael Stack
   Resolution: Fixed

Sent announcement email, ran all steps in above list. Downloads will update 
tonight. Resolving.

> Release 2.3.6
> -
>
> Key: HBASE-26162
> URL: https://issues.apache.org/jira/browse/HBASE-26162
> Project: HBase
>  Issue Type: Task
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 2.3.7
>
> Attachments: image-2021-08-02-09-54-56-469.png
>
>
> 2.3.6RC3 was voted as 2.3.6 release.
> Run the release steps listed here for 2.3.6
> !image-2021-08-02-09-54-56-469.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26162) Release 2.3.6

2021-08-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391704#comment-17391704
 ] 

Michael Stack commented on HBASE-26162:
---

Moved stable pointer from 2.3.5 to 2.3.6.

> Release 2.3.6
> -
>
> Key: HBASE-26162
> URL: https://issues.apache.org/jira/browse/HBASE-26162
> Project: HBase
>  Issue Type: Task
>Reporter: Michael Stack
>Priority: Major
> Attachments: image-2021-08-02-09-54-56-469.png
>
>
> 2.3.6RC3 was voted as 2.3.6 release.
> Run the release steps listed here for 2.3.6
> !image-2021-08-02-09-54-56-469.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26162) Release 2.3.6

2021-08-02 Thread Michael Stack (Jira)
Michael Stack created HBASE-26162:
-

 Summary: Release 2.3.6
 Key: HBASE-26162
 URL: https://issues.apache.org/jira/browse/HBASE-26162
 Project: HBase
  Issue Type: Task
Reporter: Michael Stack
 Attachments: image-2021-08-02-09-54-56-469.png

2.3.6RC3 was voted as 2.3.6 release.

Run the release steps listed here for 2.3.6

!image-2021-08-02-09-54-56-469.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-08-02 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26027:
--
Fix Version/s: (was: 2.3.6)
   2.3.7

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.4.6, 2.3.7
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26153) [create-release] Use cmd-line defined env vars

2021-07-29 Thread Michael Stack (Jira)
Michael Stack created HBASE-26153:
-

 Summary: [create-release] Use cmd-line defined env vars
 Key: HBASE-26153
 URL: https://issues.apache.org/jira/browse/HBASE-26153
 Project: HBase
  Issue Type: Improvement
  Components: RC
Reporter: Michael Stack


Minor item. The create-release scripts allows defining some of the variables 
used on the command line but not all. Fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26001) When turn on access control, the cell level TTL of Increment and Append operations is invalid.

2021-07-27 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388450#comment-17388450
 ] 

Michael Stack commented on HBASE-26001:
---

I don't think the IllegalArgumentException causes the test to fail. The version 
of hadoop in hbase 2.3 is old but would be odd needing to update it to make a 
unit test pass? Seems like there is something about the 2.3 hbase context that 
makes the test fail. I didn't spend much time on it. [~xytss123]  Thanks.

> When turn on access control, the cell level TTL of Increment and Append 
> operations is invalid.
> --
>
> Key: HBASE-26001
> URL: https://issues.apache.org/jira/browse/HBASE-26001
> Project: HBase
>  Issue Type: Bug
>  Components: Coprocessors
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.6.7, 2.5.0, 2.4.5
>
>
> AccessController postIncrementBeforeWAL() and postAppendBeforeWAL() methods 
> will rewrite the new cell's tags by the old cell's. This will makes the other 
> kinds of tag in new cell invisible (such as TTL tag) after this. As in 
> Increment and Append operations, the new cell has already catch forward all 
> tags of the old cell and TTL tag from mutation operation, here in 
> AccessController we do not need to rewrite the tags once again. Also, the TTL 
> tag of newCell will be invisible in the new created cell. Actually, in 
> Increment and Append operations, the newCell has already copied all tags of 
> the oldCell. So the oldCell is useless here.
> {code:java}
> private Cell createNewCellWithTags(Mutation mutation, Cell oldCell, Cell 
> newCell) {
> // Collect any ACLs from the old cell
> List tags = Lists.newArrayList();
> List aclTags = Lists.newArrayList();
> ListMultimap perms = ArrayListMultimap.create();
> if (oldCell != null) {
>   Iterator tagIterator = PrivateCellUtil.tagsIterator(oldCell);
>   while (tagIterator.hasNext()) {
> Tag tag = tagIterator.next();
> if (tag.getType() != PermissionStorage.ACL_TAG_TYPE) {
>   // Not an ACL tag, just carry it through
>   if (LOG.isTraceEnabled()) {
> LOG.trace("Carrying forward tag from " + oldCell + ": type " + 
> tag.getType()
> + " length " + tag.getValueLength());
>   }
>   tags.add(tag);
> } else {
>   aclTags.add(tag);
> }
>   }
> }
> // Do we have an ACL on the operation?
> byte[] aclBytes = mutation.getACL();
> if (aclBytes != null) {
>   // Yes, use it
>   tags.add(new ArrayBackedTag(PermissionStorage.ACL_TAG_TYPE, aclBytes));
> } else {
>   // No, use what we carried forward
>   if (perms != null) {
> // TODO: If we collected ACLs from more than one tag we may have a
> // List of size > 1, this can be collapsed into a single
> // Permission
> if (LOG.isTraceEnabled()) {
>   LOG.trace("Carrying forward ACLs from " + oldCell + ": " + perms);
> }
> tags.addAll(aclTags);
>   }
> }
> // If we have no tags to add, just return
> if (tags.isEmpty()) {
>   return newCell;
> }
> // Here the new cell's tags will be in visible.
> return PrivateCellUtil.createCell(newCell, tags);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26146) Allow custom opts for hbck in hbase bin

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26146.
---
Fix Version/s: 2.4.6
 Release Note: Adds HBASE_HBCK_OPTS environment variable to bin/hbase for 
passing extra options to hbck/hbck2. Defaults to HBASE_SERVER_JAAS_OPTS if 
specified, or HBASE_REGIONSERVER_OPTS.
   Resolution: Fixed

Pushed #3537 to branch-2.4. Re-resolving.

 

Added a release note [~anoop.hbase]

> Allow custom opts for hbck in hbase bin
> ---
>
> Key: HBASE-26146
> URL: https://issues.apache.org/jira/browse/HBASE-26146
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.6
>
>
> https://issues.apache.org/jira/browse/HBASE-15145 made it so that when you 
> execute {{hbase hbck}}, the regionserver or JAAS opts are added automatically 
> to the command line. This is problematic in some cases depending on what 
> regionserver opts have been set. For instance, one might configure a jmx port 
> for the regionserver but then hbck will fail due to a port conflict if run on 
> the same host as a regionserver. Another example would be that a regionserver 
> might define an {{-Xms}} value which is significantly more than hbck requires.
>  
> We should make it possible for users to define their own HBASE_HBCK_OPTS 
> which take precedence over the server opts added by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HBASE-26146) Allow custom opts for hbck in hbase bin

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack reopened HBASE-26146:
---

Reopening to apply backport to branch-2.3 (#3537)

> Allow custom opts for hbck in hbase bin
> ---
>
> Key: HBASE-26146
> URL: https://issues.apache.org/jira/browse/HBASE-26146
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> https://issues.apache.org/jira/browse/HBASE-15145 made it so that when you 
> execute {{hbase hbck}}, the regionserver or JAAS opts are added automatically 
> to the command line. This is problematic in some cases depending on what 
> regionserver opts have been set. For instance, one might configure a jmx port 
> for the regionserver but then hbck will fail due to a port conflict if run on 
> the same host as a regionserver. Another example would be that a regionserver 
> might define an {{-Xms}} value which is significantly more than hbck requires.
>  
> We should make it possible for users to define their own HBASE_HBCK_OPTS 
> which take precedence over the server opts added by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26148) Backport HBASE-26146 to branch-2.4

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26148.
---
Resolution: Invalid

> Backport HBASE-26146 to branch-2.4
> --
>
> Key: HBASE-26148
> URL: https://issues.apache.org/jira/browse/HBASE-26148
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26148) Backport HBASE-26146 to branch-2.4

2021-07-27 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388438#comment-17388438
 ] 

Michael Stack commented on HBASE-26148:
---

FYi, the PR subject should be HBASE-26148 Backport HBASE-26146 to branch-2.4 .. 
then the PR would be added here.. Summary was the parent issues so the PR ended 
up there. Let me close this and make the commit against the parent then 
[~bbeaudreault]

> Backport HBASE-26146 to branch-2.4
> --
>
> Key: HBASE-26148
> URL: https://issues.apache.org/jira/browse/HBASE-26148
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25537) Misleading Range metrcis

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25537:
--
Fix Version/s: (was: 2.3.7)

> Misleading Range metrcis 
> -
>
> Key: HBASE-25537
> URL: https://issues.apache.org/jira/browse/HBASE-25537
> Project: HBase
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Attachments: Screen Shot 2021-01-27 at 1.09.32 PM.png
>
>
> Found some cases that max value is included in a smaller range, which is 
> confusing. Please see the attach file. The max is 7032, however, it cannot be 
> found in the timeRange report. The issue is that it is included in the 
> 1000~3000 range. In this case, the time range should be 1000 - infinite. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26120:
--
Fix Version/s: (was: 2.3.7)
   2.3.6

> New replication gets stuck or data loss when multiwal groups more than 10
> -
>
> Key: HBASE-26120
> URL: https://issues.apache.org/jira/browse/HBASE-26120
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.7.1, 2.4.5
>Reporter: Jasee Tao
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.5.0, 2.3.6, 2.4.5, 1.7.2
>
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
> Iterator iterator = latestPaths.iterator();
> while (iterator.hasNext()) {
>   Path path = iterator.next();
>   if (path.getName().contains(logPrefix)) {
> iterator.remove();
> break;
>   }
> }
> this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last 
> WALlog and all of them will be enqueue for replication when new replication  
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of 
> WALlog group will be _regionserver.null0.timestamp_ to 
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ 
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with 
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on 
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not 
> in replication queue at startup.
> Because of 
> [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there 
> is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer 
> may not delete this znode, and other regionserver can't not pick up this 
> queue for replication failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26146) Allow custom opts for hbck in hbase bin

2021-07-27 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26146.
---
Fix Version/s: 3.0.0-alpha-2
   2.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks for the patch [~bbeaudreault] Merged to branch-2+.  It didn't go to 
branch-2.4. Conflicts. Make a subtask for a backport if you want it in 2.4/2.3 
boss.

> Allow custom opts for hbck in hbase bin
> ---
>
> Key: HBASE-26146
> URL: https://issues.apache.org/jira/browse/HBASE-26146
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> https://issues.apache.org/jira/browse/HBASE-15145 made it so that when you 
> execute {{hbase hbck}}, the regionserver or JAAS opts are added automatically 
> to the command line. This is problematic in some cases depending on what 
> regionserver opts have been set. For instance, one might configure a jmx port 
> for the regionserver but then hbck will fail due to a port conflict if run on 
> the same host as a regionserver. Another example would be that a regionserver 
> might define an {{-Xms}} value which is significantly more than hbck requires.
>  
> We should make it possible for users to define their own HBASE_HBCK_OPTS 
> which take precedence over the server opts added by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26121) Formatter to convert from epoch time to human readable date format.

2021-07-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387783#comment-17387783
 ] 

Michael Stack commented on HBASE-26121:
---

Sounds good. Check out HBASE-6592 ?

> Formatter to convert from epoch time to human readable date format.
> ---
>
> Key: HBASE-26121
> URL: https://issues.apache.org/jira/browse/HBASE-26121
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Reporter: Rushabh Shah
>Priority: Major
>
> In shell, we have custom formatter to convert from bytes to Long/Int for 
> long/int data type values.
> Many times we store the epoch timestamp (event creation, updation time) as 
> long in our table columns. Even after converting this column to Long, the 
> date is not in a human readable format. We still have to convert this long 
> into date using some bash shell tricks and it is not convenient to do for 
> many columns. We can introduce a new format method called +toLongDate+ which 
> signifies that we want to convert the bytes to long first and then to date.
> Please let me know if any such functionality already exists and I am not 
> aware of.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

2021-07-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387618#comment-17387618
 ] 

Michael Stack commented on HBASE-26120:
---

Thanks for the notice [~zhangduo] ...Agree this is ugly. Its been a problem w/ 
a while. I was going to go forward w/ 2.3.6 anyways. If we have to do a new RC 
and a fix, good. Otherwise, I can put a notice on the release that there is 
this known issue? Thanks.

> New replication gets stuck or data loss when multiwal groups more than 10
> -
>
> Key: HBASE-26120
> URL: https://issues.apache.org/jira/browse/HBASE-26120
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.7.1, 2.4.5
>Reporter: Jasee Tao
>Priority: Critical
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
> Iterator iterator = latestPaths.iterator();
> while (iterator.hasNext()) {
>   Path path = iterator.next();
>   if (path.getName().contains(logPrefix)) {
> iterator.remove();
> break;
>   }
> }
> this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last 
> WALlog and all of them will be enqueue for replication when new replication  
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of 
> WALlog group will be _regionserver.null0.timestamp_ to 
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ 
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with 
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on 
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not 
> in replication queue at startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26001) When turn on access control, the cell level TTL of Increment and Append operations is invalid.

2021-07-26 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26001.
---
Resolution: Fixed

Removed 2.3.6 as fix version after revert.

> When turn on access control, the cell level TTL of Increment and Append 
> operations is invalid.
> --
>
> Key: HBASE-26001
> URL: https://issues.apache.org/jira/browse/HBASE-26001
> Project: HBase
>  Issue Type: Bug
>  Components: Coprocessors
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Minor
> Fix For: 2.6.7, 2.5.0, 2.4.5, 3.0.0-alpha-1
>
>
> AccessController postIncrementBeforeWAL() and postAppendBeforeWAL() methods 
> will rewrite the new cell's tags by the old cell's. This will makes the other 
> kinds of tag in new cell invisible (such as TTL tag) after this. As in 
> Increment and Append operations, the new cell has already catch forward all 
> tags of the old cell and TTL tag from mutation operation, here in 
> AccessController we do not need to rewrite the tags once again. Also, the TTL 
> tag of newCell will be invisible in the new created cell. Actually, in 
> Increment and Append operations, the newCell has already copied all tags of 
> the oldCell. So the oldCell is useless here.
> {code:java}
> private Cell createNewCellWithTags(Mutation mutation, Cell oldCell, Cell 
> newCell) {
> // Collect any ACLs from the old cell
> List tags = Lists.newArrayList();
> List aclTags = Lists.newArrayList();
> ListMultimap perms = ArrayListMultimap.create();
> if (oldCell != null) {
>   Iterator tagIterator = PrivateCellUtil.tagsIterator(oldCell);
>   while (tagIterator.hasNext()) {
> Tag tag = tagIterator.next();
> if (tag.getType() != PermissionStorage.ACL_TAG_TYPE) {
>   // Not an ACL tag, just carry it through
>   if (LOG.isTraceEnabled()) {
> LOG.trace("Carrying forward tag from " + oldCell + ": type " + 
> tag.getType()
> + " length " + tag.getValueLength());
>   }
>   tags.add(tag);
> } else {
>   aclTags.add(tag);
> }
>   }
> }
> // Do we have an ACL on the operation?
> byte[] aclBytes = mutation.getACL();
> if (aclBytes != null) {
>   // Yes, use it
>   tags.add(new ArrayBackedTag(PermissionStorage.ACL_TAG_TYPE, aclBytes));
> } else {
>   // No, use what we carried forward
>   if (perms != null) {
> // TODO: If we collected ACLs from more than one tag we may have a
> // List of size > 1, this can be collapsed into a single
> // Permission
> if (LOG.isTraceEnabled()) {
>   LOG.trace("Carrying forward ACLs from " + oldCell + ": " + perms);
> }
> tags.addAll(aclTags);
>   }
> }
> // If we have no tags to add, just return
> if (tags.isEmpty()) {
>   return newCell;
> }
> // Here the new cell's tags will be in visible.
> return PrivateCellUtil.createCell(newCell, tags);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26001) When turn on access control, the cell level TTL of Increment and Append operations is invalid.

2021-07-26 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26001:
--
Fix Version/s: (was: 2.3.6)

> When turn on access control, the cell level TTL of Increment and Append 
> operations is invalid.
> --
>
> Key: HBASE-26001
> URL: https://issues.apache.org/jira/browse/HBASE-26001
> Project: HBase
>  Issue Type: Bug
>  Components: Coprocessors
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.6.7, 2.5.0, 2.4.5
>
>
> AccessController postIncrementBeforeWAL() and postAppendBeforeWAL() methods 
> will rewrite the new cell's tags by the old cell's. This will makes the other 
> kinds of tag in new cell invisible (such as TTL tag) after this. As in 
> Increment and Append operations, the new cell has already catch forward all 
> tags of the old cell and TTL tag from mutation operation, here in 
> AccessController we do not need to rewrite the tags once again. Also, the TTL 
> tag of newCell will be invisible in the new created cell. Actually, in 
> Increment and Append operations, the newCell has already copied all tags of 
> the oldCell. So the oldCell is useless here.
> {code:java}
> private Cell createNewCellWithTags(Mutation mutation, Cell oldCell, Cell 
> newCell) {
> // Collect any ACLs from the old cell
> List tags = Lists.newArrayList();
> List aclTags = Lists.newArrayList();
> ListMultimap perms = ArrayListMultimap.create();
> if (oldCell != null) {
>   Iterator tagIterator = PrivateCellUtil.tagsIterator(oldCell);
>   while (tagIterator.hasNext()) {
> Tag tag = tagIterator.next();
> if (tag.getType() != PermissionStorage.ACL_TAG_TYPE) {
>   // Not an ACL tag, just carry it through
>   if (LOG.isTraceEnabled()) {
> LOG.trace("Carrying forward tag from " + oldCell + ": type " + 
> tag.getType()
> + " length " + tag.getValueLength());
>   }
>   tags.add(tag);
> } else {
>   aclTags.add(tag);
> }
>   }
> }
> // Do we have an ACL on the operation?
> byte[] aclBytes = mutation.getACL();
> if (aclBytes != null) {
>   // Yes, use it
>   tags.add(new ArrayBackedTag(PermissionStorage.ACL_TAG_TYPE, aclBytes));
> } else {
>   // No, use what we carried forward
>   if (perms != null) {
> // TODO: If we collected ACLs from more than one tag we may have a
> // List of size > 1, this can be collapsed into a single
> // Permission
> if (LOG.isTraceEnabled()) {
>   LOG.trace("Carrying forward ACLs from " + oldCell + ": " + perms);
> }
> tags.addAll(aclTags);
>   }
> }
> // If we have no tags to add, just return
> if (tags.isEmpty()) {
>   return newCell;
> }
> // Here the new cell's tags will be in visible.
> return PrivateCellUtil.createCell(newCell, tags);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HBASE-26001) When turn on access control, the cell level TTL of Increment and Append operations is invalid.

2021-07-26 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack reopened HBASE-26001:
---

Reopening to revert from branch-2.3.   The  new test added here is failing 100% 
on branch-2.3. See bottom of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html]
 Thanks.

> When turn on access control, the cell level TTL of Increment and Append 
> operations is invalid.
> --
>
> Key: HBASE-26001
> URL: https://issues.apache.org/jira/browse/HBASE-26001
> Project: HBase
>  Issue Type: Bug
>  Components: Coprocessors
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.6.7, 2.5.0, 2.3.6, 2.4.5
>
>
> AccessController postIncrementBeforeWAL() and postAppendBeforeWAL() methods 
> will rewrite the new cell's tags by the old cell's. This will makes the other 
> kinds of tag in new cell invisible (such as TTL tag) after this. As in 
> Increment and Append operations, the new cell has already catch forward all 
> tags of the old cell and TTL tag from mutation operation, here in 
> AccessController we do not need to rewrite the tags once again. Also, the TTL 
> tag of newCell will be invisible in the new created cell. Actually, in 
> Increment and Append operations, the newCell has already copied all tags of 
> the oldCell. So the oldCell is useless here.
> {code:java}
> private Cell createNewCellWithTags(Mutation mutation, Cell oldCell, Cell 
> newCell) {
> // Collect any ACLs from the old cell
> List tags = Lists.newArrayList();
> List aclTags = Lists.newArrayList();
> ListMultimap perms = ArrayListMultimap.create();
> if (oldCell != null) {
>   Iterator tagIterator = PrivateCellUtil.tagsIterator(oldCell);
>   while (tagIterator.hasNext()) {
> Tag tag = tagIterator.next();
> if (tag.getType() != PermissionStorage.ACL_TAG_TYPE) {
>   // Not an ACL tag, just carry it through
>   if (LOG.isTraceEnabled()) {
> LOG.trace("Carrying forward tag from " + oldCell + ": type " + 
> tag.getType()
> + " length " + tag.getValueLength());
>   }
>   tags.add(tag);
> } else {
>   aclTags.add(tag);
> }
>   }
> }
> // Do we have an ACL on the operation?
> byte[] aclBytes = mutation.getACL();
> if (aclBytes != null) {
>   // Yes, use it
>   tags.add(new ArrayBackedTag(PermissionStorage.ACL_TAG_TYPE, aclBytes));
> } else {
>   // No, use what we carried forward
>   if (perms != null) {
> // TODO: If we collected ACLs from more than one tag we may have a
> // List of size > 1, this can be collapsed into a single
> // Permission
> if (LOG.isTraceEnabled()) {
>   LOG.trace("Carrying forward ACLs from " + oldCell + ": " + perms);
> }
> tags.addAll(aclTags);
>   }
> }
> // If we have no tags to add, just return
> if (tags.isEmpty()) {
>   return newCell;
> }
> // Here the new cell's tags will be in visible.
> return PrivateCellUtil.createCell(newCell, tags);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387508#comment-17387508
 ] 

Michael Stack commented on HBASE-25984:
---

Ignore my previous comment. Bisect (or more likely the pilot) identified the 
wrong issue... it is not this that is cause of failed test.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387427#comment-17387427
 ] 

Michael Stack commented on HBASE-26027:
---

Oops. Already reverted.

 

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 2.4.6
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-24 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386798#comment-17386798
 ] 

Michael Stack commented on HBASE-26027:
---

git bisect says this is what caused TestClientOperationTimeout to fail near 
100% of the time on branch-2.3. See the tail of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html]

 

Let me see if can fix... else will revert. Thanks.

 

9c7d9fa229ea97c288fc1f6843bb05a7a4df4b87 is the first bad commit
commit 9c7d9fa229ea97c288fc1f6843bb05a7a4df4b87
Author: bsglz <18031...@qq.com>
Date: Thu Jul 1 19:16:51 2021 +0800

HBASE-26027 The calling of HTable.batch blocked at AsyncRequestFuture… (#3419)

* HBASE-26027 The calling of HTable.batch blocked at 
AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

(cherry picked from commit 1d6eb77ef8e813ce1050afe6d71954462ab0c28a)

.../org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)
bisect run success

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 2.4.6
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 

[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-24 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26027:
--
Fix Version/s: (was: 2.3.7)
   2.3.6

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 2.4.6
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-24 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386780#comment-17386780
 ] 

Michael Stack edited comment on HBASE-25984 at 7/24/21, 10:14 PM:
--

git bisect flags this pr as why we have 100% fail running 
TestPostIncrementAndAppendBeforeWAL on branch 2.3 (Run on mac or see bottom of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html)]
 Let me see if can fix.


was (Author: stack):
git bisect flags this pr as why we have 100% fail running 
TestPostIncrementAndAppendBeforeWAL (See bottom of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html)]
 Let me see if can fix.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-24 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386780#comment-17386780
 ] 

Michael Stack commented on HBASE-25984:
---

git bisect flags this pr as why we have 100% fail running 
TestPostIncrementAndAppendBeforeWAL (See bottom of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html)]
 Let me see if can fix.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25165) Change 'State time' in UI so sorts

2021-07-24 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25165:
--
Fix Version/s: 2.3.6

> Change 'State time' in UI so sorts
> --
>
> Key: HBASE-25165
> URL: https://issues.apache.org/jira/browse/HBASE-25165
> Project: HBase
>  Issue Type: Bug
>  Components: UI
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0, 2.3.6
>
> Attachments: Screen Shot 2020-10-07 at 4.15.32 PM.png, Screen Shot 
> 2020-10-07 at 4.15.42 PM.png
>
>
> Here is a minor issue.
> I had an issue w/ crashing servers. The servers were auto-restarted on crash.
> To find the crashing servers, I was sorting on the 'Start time' column in the 
> Master UI. This basically worked but the sort is unreliable as the date we 
> display starts with days-of-the-week.
> This issue is about moving to display start time in iso8601 which is sortable 
> (and occupies less real estate). Let me add some images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25165) Change 'State time' in UI so sorts

2021-07-24 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-25165.
---
Resolution: Fixed

Pushed on branch-2.3. Re-resolving.

> Change 'State time' in UI so sorts
> --
>
> Key: HBASE-25165
> URL: https://issues.apache.org/jira/browse/HBASE-25165
> Project: HBase
>  Issue Type: Bug
>  Components: UI
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Minor
> Fix For: 2.3.6, 2.4.0, 3.0.0-alpha-1
>
> Attachments: Screen Shot 2020-10-07 at 4.15.32 PM.png, Screen Shot 
> 2020-10-07 at 4.15.42 PM.png
>
>
> Here is a minor issue.
> I had an issue w/ crashing servers. The servers were auto-restarted on crash.
> To find the crashing servers, I was sorting on the 'Start time' column in the 
> Master UI. This basically worked but the sort is unreliable as the date we 
> display starts with days-of-the-week.
> This issue is about moving to display start time in iso8601 which is sortable 
> (and occupies less real estate). Let me add some images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HBASE-25165) Change 'State time' in UI so sorts

2021-07-24 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack reopened HBASE-25165:
---

Reopen to push on branch-2.3

> Change 'State time' in UI so sorts
> --
>
> Key: HBASE-25165
> URL: https://issues.apache.org/jira/browse/HBASE-25165
> Project: HBase
>  Issue Type: Bug
>  Components: UI
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
> Attachments: Screen Shot 2020-10-07 at 4.15.32 PM.png, Screen Shot 
> 2020-10-07 at 4.15.42 PM.png
>
>
> Here is a minor issue.
> I had an issue w/ crashing servers. The servers were auto-restarted on crash.
> To find the crashing servers, I was sorting on the 'Start time' column in the 
> Master UI. This basically worked but the sort is unreliable as the date we 
> display starts with days-of-the-week.
> This issue is about moving to display start time in iso8601 which is sortable 
> (and occupies less real estate). Let me add some images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386442#comment-17386442
 ] 

Michael Stack commented on HBASE-26027:
---

Just to say that branch-2.3 builds were passing w/ this PR in place. Running on 
mac os is where it failed for me. Did not dig why.

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.4.6, 2.3.7
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25698) Persistent IllegalReferenceCountException at scanner open when using TinyLfuBlockCache

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386403#comment-17386403
 ] 

Michael Stack commented on HBASE-25698:
---

Added 2.3.6 to fix versions... This PR was applied to branch-2.3 too.

> Persistent IllegalReferenceCountException at scanner open when using 
> TinyLfuBlockCache
> --
>
> Key: HBASE-25698
> URL: https://issues.apache.org/jira/browse/HBASE-25698
> Project: HBase
>  Issue Type: Bug
>  Components: BucketCache, HFile, Scanners
>Affects Versions: 2.4.2
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.5
>
>
> Persistent scanner open failure with offheap read path enabled.
> Not sure how it happened. Test scenario was HBase 1 cluster replicating to 
> HBase 2 cluster. ITBLL as data generator at source, calm policy only. Scanner 
> open errors on sink HBase 2 cluster later during ITBLL verify phase. Sink 
> schema settings bloom=ROW encoding=FAST_DIFF compression=NONE.
> {noformat}
> Caused by: 
> org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: 
> refCnt: 0, decrement: 1
> at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74)
> at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138)
> at 
> org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)
> at org.apache.hadoop.hbase.nio.ByteBuff.release(ByteBuff.java:79)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.release(HFileBlock.java:429)
> at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilter.contains(CompoundBloomFilter.java:109)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.checkGeneralBloomFilter(StoreFileReader.java:433)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.passesGeneralRowBloomFilter(StoreFileReader.java:322)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.passesBloomFilter(StoreFileReader.java:251)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.shouldUseScanner(StoreFileScanner.java:491)
> at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.selectScannersFrom(StoreScanner.java:471)
> at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:249)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2177)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2168)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:7172)
> {noformat}
> Bloom filter type on all files here is ROW, block encoding is FAST_DIFF:
> {noformat}
> hbase:017:0> describe "IntegrationTestBigLinkedList"
> Table IntegrationTestBigLinkedList is ENABLED 
>   
> IntegrationTestBigLinkedList  
>   
> COLUMN FAMILIES DESCRIPTION   
>   
> {NAME => 'big', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DIF
> F', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'} 
> {NAME => 'meta', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
> FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
> {NAME => 'tiny', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
> FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25698) Persistent IllegalReferenceCountException at scanner open when using TinyLfuBlockCache

2021-07-23 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25698:
--
Fix Version/s: 2.3.6

> Persistent IllegalReferenceCountException at scanner open when using 
> TinyLfuBlockCache
> --
>
> Key: HBASE-25698
> URL: https://issues.apache.org/jira/browse/HBASE-25698
> Project: HBase
>  Issue Type: Bug
>  Components: BucketCache, HFile, Scanners
>Affects Versions: 2.4.2
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.5
>
>
> Persistent scanner open failure with offheap read path enabled.
> Not sure how it happened. Test scenario was HBase 1 cluster replicating to 
> HBase 2 cluster. ITBLL as data generator at source, calm policy only. Scanner 
> open errors on sink HBase 2 cluster later during ITBLL verify phase. Sink 
> schema settings bloom=ROW encoding=FAST_DIFF compression=NONE.
> {noformat}
> Caused by: 
> org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: 
> refCnt: 0, decrement: 1
> at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74)
> at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138)
> at 
> org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)
> at org.apache.hadoop.hbase.nio.ByteBuff.release(ByteBuff.java:79)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.release(HFileBlock.java:429)
> at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilter.contains(CompoundBloomFilter.java:109)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.checkGeneralBloomFilter(StoreFileReader.java:433)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.passesGeneralRowBloomFilter(StoreFileReader.java:322)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileReader.passesBloomFilter(StoreFileReader.java:251)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.shouldUseScanner(StoreFileScanner.java:491)
> at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.selectScannersFrom(StoreScanner.java:471)
> at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:249)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2177)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2168)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:7172)
> {noformat}
> Bloom filter type on all files here is ROW, block encoding is FAST_DIFF:
> {noformat}
> hbase:017:0> describe "IntegrationTestBigLinkedList"
> Table IntegrationTestBigLinkedList is ENABLED 
>   
> IntegrationTestBigLinkedList  
>   
> COLUMN FAMILIES DESCRIPTION   
>   
> {NAME => 'big', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DIF
> F', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'} 
> {NAME => 'meta', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
> FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
> {NAME => 'tiny', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
> KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
> FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
> => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25738) Backport HBASE-24305 to branch-2.2

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386378#comment-17386378
 ] 

Michael Stack commented on HBASE-25738:
---

Note to say this PR got committed referencing the parent HBASE-25305 Jira 
rather than this one; i.e. one PR was atributed to two jiras in the (EOL'd) 
2.2.7 release notes.

> Backport HBASE-24305 to branch-2.2
> --
>
> Key: HBASE-25738
> URL: https://issues.apache.org/jira/browse/HBASE-25738
> Project: HBase
>  Issue Type: Task
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
>  Labels: backport
> Fix For: 2.2.7
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25734) Backport HBASE-24305 to branch-2.4

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386376#comment-17386376
 ] 

Michael Stack commented on HBASE-25734:
---

Note to say this PR got committed referencing the parent HBASE-25305 Jira 
rather than this one; i.e. one PR was atributed to two jiras in the 2.4.3 
release notes.

> Backport HBASE-24305 to branch-2.4
> --
>
> Key: HBASE-25734
> URL: https://issues.apache.org/jira/browse/HBASE-25734
> Project: HBase
>  Issue Type: Task
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
>  Labels: backport
> Fix For: 2.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24305) Handle deprecations in ServerName

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386375#comment-17386375
 ] 

Michael Stack edited comment on HBASE-24305 at 7/23/21, 5:00 PM:
-

It looks like the backports got committed with the commit message referencing 
this Jira and not the dedicated sub-task Jiras so I removed the relevant 'Fix 
Versions' from the sub-issues in favor of referencing releases here in this 
issues 'Fix versions' list (Except where releases have already gone out on 
branch-2.4 and branch-2.2). I checked the backports and they seem correct, 
customized to the branch and not straight backports of the master PR.


was (Author: stack):
It looks like the backports got committed with the commit message referencing 
this Jira and not the dedicated sub-task Jiras so I removed the relevant 'Fix 
Versions' from the sub-issues in favor of referencing releases here in this 
issues 'Fix versions' list (Except where releases have already gone out on 
branch-2.4 and branch-2.2).

> Handle deprecations in ServerName
> -
>
> Key: HBASE-24305
> URL: https://issues.apache.org/jira/browse/HBASE-24305
> Project: HBase
>  Issue Type: Task
>Affects Versions: 3.0.0-alpha-1
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.2.7, 2.4.3, 2.3.6
>
>
> Some functions in {{ServerName}} were deprecated in 2.0.0 and should be 
> removed for version 3.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24305) Handle deprecations in ServerName

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386375#comment-17386375
 ] 

Michael Stack commented on HBASE-24305:
---

It looks like the backports got committed with the commit message referencing 
this Jira and not the dedicated sub-task Jiras so I removed the relevant 'Fix 
Versions' from the sub-issues in favor of referencing releases here in this 
issues 'Fix versions' list (Except where releases have already gone out on 
branch-2.4 and branch-2.2).

> Handle deprecations in ServerName
> -
>
> Key: HBASE-24305
> URL: https://issues.apache.org/jira/browse/HBASE-24305
> Project: HBase
>  Issue Type: Task
>Affects Versions: 3.0.0-alpha-1
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.2.7, 2.4.3, 2.3.6
>
>
> Some functions in {{ServerName}} were deprecated in 2.0.0 and should be 
> removed for version 3.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24514) Backport HBASE-24305 to branch-2

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386374#comment-17386374
 ] 

Michael Stack commented on HBASE-24514:
---

Removing 2.5.0 as fix version. There is no HBASE-24514 referenced in branch-2. 
The parent HBASE-24305 IS referenced in branch-2. I compared this PR to that 
which was committed to branch-2 with HBASE-24305 in the commit message and they 
align. The PR committed to branch-2 is NOT the PR that was applied to master.

> Backport HBASE-24305 to branch-2
> 
>
> Key: HBASE-24514
> URL: https://issues.apache.org/jira/browse/HBASE-24514
> Project: HBase
>  Issue Type: Task
>Affects Versions: 2.5.0
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Major
>  Labels: backport
>
> Backport the changes from HBASE-24305, which are not related to removed 
> deprecated methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24514) Backport HBASE-24305 to branch-2

2021-07-23 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-24514:
--
Fix Version/s: (was: 2.5.0)

> Backport HBASE-24305 to branch-2
> 
>
> Key: HBASE-24514
> URL: https://issues.apache.org/jira/browse/HBASE-24514
> Project: HBase
>  Issue Type: Task
>Affects Versions: 2.5.0
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Major
>  Labels: backport
>
> Backport the changes from HBASE-24305, which are not related to removed 
> deprecated methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25737) Backport HBASE-24305 to branch-2.3

2021-07-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385965#comment-17385965
 ] 

Michael Stack commented on HBASE-25737:
---

I checked the patch committed on branch-2.3. Its the PR that is here... so this 
just looks like the commit had the parent for subject. Let me check the 
backports on other branches.

> Backport HBASE-24305 to branch-2.3
> --
>
> Key: HBASE-25737
> URL: https://issues.apache.org/jira/browse/HBASE-25737
> Project: HBase
>  Issue Type: Task
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
>  Labels: backport
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25737) Backport HBASE-24305 to branch-2.3

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385961#comment-17385961
 ] 

Michael Stack commented on HBASE-25737:
---

I'm removing 2.3.6 as 'Fix Version'. There is no HBASE-25737 on branch-2.3. The 
PR is in branch-2.3 but git blame says the PR came in on this issue, the parent:
{code:java}
commit a30d87369207785f078589e2a59a9250813b9aa3
Author: Jan Hentschel 
Date:   Thu Apr 8 15:06:02 2021 +0200 
HBASE-24305 Prepare deprecations in ServerName (#1666) (#3128)
Signed-off-by: Duo Zhang 
Signed-off-by: stack {code}
The PR seems to be all there so will just leave it accredited to HBASE-24305.

> Backport HBASE-24305 to branch-2.3
> --
>
> Key: HBASE-25737
> URL: https://issues.apache.org/jira/browse/HBASE-25737
> Project: HBase
>  Issue Type: Task
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
>  Labels: backport
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25737) Backport HBASE-24305 to branch-2.3

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25737:
--
Fix Version/s: (was: 2.3.6)

> Backport HBASE-24305 to branch-2.3
> --
>
> Key: HBASE-25737
> URL: https://issues.apache.org/jira/browse/HBASE-25737
> Project: HBase
>  Issue Type: Task
>Reporter: Jan Hentschel
>Assignee: Jan Hentschel
>Priority: Minor
>  Labels: backport
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25792) Filter out o.a.hadoop.thirdparty building shaded jars

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385947#comment-17385947
 ] 

Michael Stack commented on HBASE-25792:
---

The revert was not complete... This bit was left over. I just reverted the 
below to complete the revert from branch-2.3.

 
{code:java}
commit 1832418ca7b78739c8443a4a1f3de587e847a41a (HEAD -> 2.3, origin/branch-2.3)
Author: stack 
Date:   Thu Jul 22 22:25:36 2021 -0700Revert "HBASE-25792 Filter out 
o.a.hadoop.thirdparty building shaded jars"This reverts commit 
59a67c6dfbae00bbf9c7994224b66406260e8f50.diff --git a/hbase-shaded/pom.xml 
b/hbase-shaded/pom.xml
index 27e6d3cfd3..615bdb2a99 100644
--- a/hbase-shaded/pom.xml
+++ b/hbase-shaded/pom.xml
@@ -541,13 +541,6 @@
   keytab.txt
 
   
-  
-
-*:*
-
-  
org/apache/hadoop/thirdparty/**/*
-
-  
 
 
  {code}

> Filter out o.a.hadoop.thirdparty building shaded jars
> -
>
> Key: HBASE-25792
> URL: https://issues.apache.org/jira/browse/HBASE-25792
> Project: HBase
>  Issue Type: Bug
>  Components: shading
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Hadoop 3.3.1 (unreleased currently) shades guava. The shaded guava then trips 
> the check in our shading that tries to exclude hadoop bits from the fat jars 
> we build.
> For the issue to trigger, need to build against tip of hadoop branch-3.3. You 
> then get this complaint:
> {code}
> [INFO] --- exec-maven-plugin:1.6.0:exec (check-jar-contents) @ 
> hbase-shaded-check-invariants ---
> [ERROR] Found artifact with unexpected contents: 
> '/Users/stack/.m2/repository/org/apache/hbase/hbase-shaded-mapreduce/2.3.6-SNAPSHOT/hbase-shaded-mapreduce-2.3.6-SNAPSHOT.jar'
> Please check the following and either correct the build or update
> the allowed list with reasoning.
> org/apache/hadoop/thirdparty/
> org/apache/hadoop/thirdparty/com/
> org/apache/hadoop/thirdparty/com/google/
> org/apache/hadoop/thirdparty/com/google/common/
> org/apache/hadoop/thirdparty/com/google/common/annotations/
> org/apache/hadoop/thirdparty/com/google/common/annotations/Beta.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtCompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtIncompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/VisibleForTesting.class
> org/apache/hadoop/thirdparty/com/google/common/base/
> org/apache/hadoop/thirdparty/com/google/common/base/Absent.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$1.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$State.class
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator.class
> org/apache/hadoop/thirdparty/com/google/common/base/Ascii.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$1.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$2.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$3.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$4.class
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25792) Filter out o.a.hadoop.thirdparty building shaded jars

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25792:
--
Fix Version/s: (was: 2.3.6)

> Filter out o.a.hadoop.thirdparty building shaded jars
> -
>
> Key: HBASE-25792
> URL: https://issues.apache.org/jira/browse/HBASE-25792
> Project: HBase
>  Issue Type: Bug
>  Components: shading
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Hadoop 3.3.1 (unreleased currently) shades guava. The shaded guava then trips 
> the check in our shading that tries to exclude hadoop bits from the fat jars 
> we build.
> For the issue to trigger, need to build against tip of hadoop branch-3.3. You 
> then get this complaint:
> {code}
> [INFO] --- exec-maven-plugin:1.6.0:exec (check-jar-contents) @ 
> hbase-shaded-check-invariants ---
> [ERROR] Found artifact with unexpected contents: 
> '/Users/stack/.m2/repository/org/apache/hbase/hbase-shaded-mapreduce/2.3.6-SNAPSHOT/hbase-shaded-mapreduce-2.3.6-SNAPSHOT.jar'
> Please check the following and either correct the build or update
> the allowed list with reasoning.
> org/apache/hadoop/thirdparty/
> org/apache/hadoop/thirdparty/com/
> org/apache/hadoop/thirdparty/com/google/
> org/apache/hadoop/thirdparty/com/google/common/
> org/apache/hadoop/thirdparty/com/google/common/annotations/
> org/apache/hadoop/thirdparty/com/google/common/annotations/Beta.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtCompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtIncompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/VisibleForTesting.class
> org/apache/hadoop/thirdparty/com/google/common/base/
> org/apache/hadoop/thirdparty/com/google/common/base/Absent.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$1.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$State.class
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator.class
> org/apache/hadoop/thirdparty/com/google/common/base/Ascii.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$1.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$2.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$3.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$4.class
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25792) Filter out o.a.hadoop.thirdparty building shaded jars

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25792:
--
Fix Version/s: 2.3.6

> Filter out o.a.hadoop.thirdparty building shaded jars
> -
>
> Key: HBASE-25792
> URL: https://issues.apache.org/jira/browse/HBASE-25792
> Project: HBase
>  Issue Type: Bug
>  Components: shading
>Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.4.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3, 2.3.6
>
>
> Hadoop 3.3.1 (unreleased currently) shades guava. The shaded guava then trips 
> the check in our shading that tries to exclude hadoop bits from the fat jars 
> we build.
> For the issue to trigger, need to build against tip of hadoop branch-3.3. You 
> then get this complaint:
> {code}
> [INFO] --- exec-maven-plugin:1.6.0:exec (check-jar-contents) @ 
> hbase-shaded-check-invariants ---
> [ERROR] Found artifact with unexpected contents: 
> '/Users/stack/.m2/repository/org/apache/hbase/hbase-shaded-mapreduce/2.3.6-SNAPSHOT/hbase-shaded-mapreduce-2.3.6-SNAPSHOT.jar'
> Please check the following and either correct the build or update
> the allowed list with reasoning.
> org/apache/hadoop/thirdparty/
> org/apache/hadoop/thirdparty/com/
> org/apache/hadoop/thirdparty/com/google/
> org/apache/hadoop/thirdparty/com/google/common/
> org/apache/hadoop/thirdparty/com/google/common/annotations/
> org/apache/hadoop/thirdparty/com/google/common/annotations/Beta.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtCompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/GwtIncompatible.class
> 
> org/apache/hadoop/thirdparty/com/google/common/annotations/VisibleForTesting.class
> org/apache/hadoop/thirdparty/com/google/common/base/
> org/apache/hadoop/thirdparty/com/google/common/base/Absent.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$1.class
> 
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator$State.class
> org/apache/hadoop/thirdparty/com/google/common/base/AbstractIterator.class
> org/apache/hadoop/thirdparty/com/google/common/base/Ascii.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$1.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$2.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$3.class
> org/apache/hadoop/thirdparty/com/google/common/base/CaseFormat$4.class
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-25938) The SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer is just a dummy one

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385941#comment-17385941
 ] 

Michael Stack edited comment on HBASE-25938 at 7/23/21, 5:09 AM:
-

This change was reverted from branch-2.3. Removing 2.3.6 as a fix version.
{code:java}
commit 075c055c871ec0b7271114fc24e24d7eb92180cc
Author: stack 
Date: Tue Jun 1 15:35:12 2021 -0700 Revert "HBASE-25938 The 
SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer 
is just a dummy one (#3329)" This reverts commit 
0620f08bdf4a28ff8e80db31d1e655b8abca7fbb.

Mistakenly pushed on branch-2.3

{code}


was (Author: stack):
This change was reverted from branch-2.3. Removing 2.3.6 as a fix version.
{code:java}
  
075c055c87 Revert "HBASE-25938 The 
SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer 
is just a dummy one (#3329)"{code}

> The SnapshotOfRegionAssignmentFromMeta.initialize call in 
> FavoredNodeLoadBalancer is just a dummy one
> -
>
> Key: HBASE-25938
> URL: https://issues.apache.org/jira/browse/HBASE-25938
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer, FavoredNodes
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
>
> After we introduced FavoredNodesManager, we do not need to load this 
> information every time when balancing, as all the updates to favored nodes 
> will go to FavoredNodesManager and it can update its in memory state 
> accordingly.
> The current SnapshotOfRegionAssignmentFromMeta.initialize is just a dummy one 
> in FavoredNodeLoadBalancer, as we never use it in the method.
> It will introduce a full scan on hbase:meta so let's just remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25938) The SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer is just a dummy one

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385941#comment-17385941
 ] 

Michael Stack commented on HBASE-25938:
---

This change was reverted from branch-2.3. Removing 2.3.6 as a fix version.
{code:java}
  
075c055c87 Revert "HBASE-25938 The 
SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer 
is just a dummy one (#3329)"{code}

> The SnapshotOfRegionAssignmentFromMeta.initialize call in 
> FavoredNodeLoadBalancer is just a dummy one
> -
>
> Key: HBASE-25938
> URL: https://issues.apache.org/jira/browse/HBASE-25938
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer, FavoredNodes
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
>
> After we introduced FavoredNodesManager, we do not need to load this 
> information every time when balancing, as all the updates to favored nodes 
> will go to FavoredNodesManager and it can update its in memory state 
> accordingly.
> The current SnapshotOfRegionAssignmentFromMeta.initialize is just a dummy one 
> in FavoredNodeLoadBalancer, as we never use it in the method.
> It will introduce a full scan on hbase:meta so let's just remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25938) The SnapshotOfRegionAssignmentFromMeta.initialize call in FavoredNodeLoadBalancer is just a dummy one

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25938:
--
Fix Version/s: (was: 2.3.6)

> The SnapshotOfRegionAssignmentFromMeta.initialize call in 
> FavoredNodeLoadBalancer is just a dummy one
> -
>
> Key: HBASE-25938
> URL: https://issues.apache.org/jira/browse/HBASE-25938
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer, FavoredNodes
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.4
>
>
> After we introduced FavoredNodesManager, we do not need to load this 
> information every time when balancing, as all the updates to favored nodes 
> will go to FavoredNodesManager and it can update its in memory state 
> accordingly.
> The current SnapshotOfRegionAssignmentFromMeta.initialize is just a dummy one 
> in FavoredNodeLoadBalancer, as we never use it in the method.
> It will introduce a full scan on hbase:meta so let's just remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26010) Backport HBASE-25703 and HBASE-26002 to branch-2.3

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26010:
--
Fix Version/s: (was: 2.3.6)

> Backport HBASE-25703 and HBASE-26002 to branch-2.3
> --
>
> Key: HBASE-26010
> URL: https://issues.apache.org/jira/browse/HBASE-26010
> Project: HBase
>  Issue Type: Improvement
>  Components: backport
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> Backport HBASE-25703 "Support conditional update in MultiRowMutationEndpoint" 
> and HBASE-26002 "MultiRowMutationEndpoint should return the result of the 
> conditional update" to branch-2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26009) Backport HBASE-25766 "Introduce RegionSplitRestriction that restricts the pattern of the split point" to branch-2.3

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26009:
--
Fix Version/s: (was: 2.3.6)

> Backport HBASE-25766 "Introduce RegionSplitRestriction that restricts the 
> pattern of the split point" to branch-2.3
> ---
>
> Key: HBASE-26009
> URL: https://issues.apache.org/jira/browse/HBASE-26009
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> Backport the parent issue to branch-2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25537) Misleading Range metrcis

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385928#comment-17385928
 ] 

Michael Stack commented on HBASE-25537:
---

Moved to 2.3.7. Shout if an issue [~huaxiangsun]

> Misleading Range metrcis 
> -
>
> Key: HBASE-25537
> URL: https://issues.apache.org/jira/browse/HBASE-25537
> Project: HBase
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.3.7
>
> Attachments: Screen Shot 2021-01-27 at 1.09.32 PM.png
>
>
> Found some cases that max value is included in a smaller range, which is 
> confusing. Please see the attach file. The max is 7032, however, it cannot be 
> found in the timeRange report. The issue is that it is included in the 
> 1000~3000 range. In this case, the time range should be 1000 - infinite. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25537) Misleading Range metrcis

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25537:
--
Fix Version/s: (was: 2.3.6)
   2.3.7

> Misleading Range metrcis 
> -
>
> Key: HBASE-25537
> URL: https://issues.apache.org/jira/browse/HBASE-25537
> Project: HBase
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.3.7
>
> Attachments: Screen Shot 2021-01-27 at 1.09.32 PM.png
>
>
> Found some cases that max value is included in a smaller range, which is 
> confusing. Please see the attach file. The max is 7032, however, it cannot be 
> found in the timeRange report. The issue is that it is included in the 
> 1000~3000 range. In this case, the time range should be 1000 - infinite. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385926#comment-17385926
 ] 

Michael Stack commented on HBASE-26027:
---

Reverted from branch-2.3 and branch-2.

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.4.6, 2.3.7
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-22 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26027:
--
Fix Version/s: (was: 2.3.6)
   2.3.7

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.4.6, 2.3.7
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10  actions to 
> finish on table: 
> [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10  actions to 
> finish on table: test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException

2021-07-22 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385922#comment-17385922
 ] 

Michael Stack commented on HBASE-26027:
---

Let me follow [~apurtell] 's lead. I tried the test on 2.3.6 and it fails for 
me on mac osx too:
{code:java}
[INFO] Running org.apache.hadoop.hbase.TestClientOperationTimeout
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 18.488 
s <<< FAILURE! - in org.apache.hadoop.hbase.TestClientOperationTimeout
[ERROR] org.apache.hadoop.hbase.TestClientOperationTimeout.testMultiPutsTimeout 
 Time elapsed: 0.629 s  <<< FAILURE!
java.lang.AssertionError: should not reach here
at 
org.apache.hadoop.hbase.TestClientOperationTimeout.testMultiPutsTimeout(TestClientOperationTimeout.java:176)[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestClientOperationTimeout.testMultiPutsTimeout:176 should not reach 
here
[INFO]
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0 {code}
Let me revert from branch-2.3 too... and might as well do 2.5 too while we are 
at it.

> The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone 
> caused by ArrayStoreException
> -
>
> Key: HBASE-26027
> URL: https://issues.apache.org/jira/browse/HBASE-26027
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.2.7, 2.3.5, 2.4.4
>Reporter: Zheng Wang
>Assignee: Zheng Wang
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 2.4.6
>
>
> The batch api of HTable contains a param named results to store result or 
> exception, its type is Object[].
> If user pass an array with other type, eg: 
> org.apache.hadoop.hbase.client.Result, and if we need to put an exception 
> into it by some reason, then the ArrayStoreException will occur in 
> AsyncRequestFutureImpl.updateResult, then the 
> AsyncRequestFutureImpl.decActionCounter will be skipped, then in the 
> AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the 
> actionsInProgress again and again, forever.
> It is better to add an cutoff calculated by operationTimeout, instead of only 
> depend on the value of actionsInProgress.
> BTW, this issue only for 2.x, since 3.x the implement has refactored.
> How to reproduce:
> 1: add sleep in RSRpcServices.multi to mock slow response
> {code:java}
> try {
>  Thread.sleep(2000);
>  } catch (InterruptedException e) {
>  e.printStackTrace();
>  }
> {code}
> 2: set time out in config
> {code:java}
> conf.set("hbase.rpc.timeout","2000");
> conf.set("hbase.client.operation.timeout","6000");
> {code}
> 3: call batch api
> {code:java}
> Table table = HbaseUtil.getTable("test");
>  byte[] cf = Bytes.toBytes("f");
>  byte[] c = Bytes.toBytes("c1");
>  List gets = new ArrayList<>();
>  for (int i = 0; i < 10; i++) {
>  byte[] rk = Bytes.toBytes("rk-" + i);
>  Get get = new Get(rk);
>  get.addColumn(cf, c);
>  gets.add(get);
>  }
>  Result[] results = new Result[gets.size()];
>  table.batch(gets, results);
> {code}
> The log will looks like below:
> {code:java}
> [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - 
> id=1 error for test processing localhost,16020,1624343786295
> java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69)
>   at 
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>   at java.util.concurrent.FutureTask.run(FutureTask.java)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10  actions to 
> finish on table: test
> [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10  actions to 
> finish on 

[jira] [Resolved] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-21 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-26062.
---
Resolution: Duplicate

Thanks [~anoop.hbase]  There is ASYNC_WAL on this cluster afterall (when I 
wrote the above, thought there was none). Resolving as duplicate of what we see 
over on HBASE-24984

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c [0x7fcc2b7d92c0+0xac]J 17162 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7fcc29bc05e8 

[jira] [Commented] (HBASE-26103) conn.getBufferedMutator(tableName) leaks thread executors and other problems (for master branch)

2021-07-20 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384541#comment-17384541
 ] 

Michael Stack commented on HBASE-26103:
---

[~shahrs87] it looks like side-effect of our removal of the AsyncProcess stuff. 
In master, we do the async client (which doesn't need the 'fake' AsyncProcess 
that used underlie the synchronous API). The BufferedMutator is still there but 
changed majorly by the below:

 

HBASE-21725 Implement BufferedMutator Based on AsyncBufferedMutator

 

If member variable unused, yeah purge I'd say.

 

Below is just search on getPool usage.. comparing master and 
branch-2.3/branch-2.

 
{code:java}
This is MASTER
kalashnikov:hbase.apache.git stack$ grep -r -e 'getPool(' hbase-*/src/main/java
hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorParams.java:
  public ExecutorService getPool() {
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactingMemStore.java:
  getPool().execute(runnable);
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactingMemStore.java:
  private ThreadPoolExecutor getPool() {

THIS Is BRANCH-2.3 (which is like branch-2)
kalashnikov:hbase.apache.git stack$ git checkout 2.3
Switched to branch '2.3'
Your branch is up to date with 'origin/branch-2.3'.
kalashnikov:hbase.apache.git stack$ grep -r -e 'getPool(' hbase-*/src/main/java
hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java:  
ExecutorService getPool() {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncProcess.java:
Objects.requireNonNull(task.getPool(), "The pool can't be NULL");
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java:
if (params.getPool() == null) {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncProcessTask.java:
this(task.getPool(), task.getTableName(), task.getRowAccess(),
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncProcessTask.java:
  public ExecutorService getPool() {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java:
this.pool = task.getPool();
hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorImpl.java:
if (params.getPool() == null) {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorImpl.java:
  this.pool = params.getPool();
hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorImpl.java:
  ExecutorService getPool() {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java:  
protected ExecutorService getPool() {
hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorParams.java:
  public ExecutorService getPool() {
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactingMemStore.java:
  getPool().execute(runnable);
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactingMemStore.java:
  private ThreadPoolExecutor getPool() { {code}

> conn.getBufferedMutator(tableName) leaks thread executors and other problems 
> (for master branch)
> 
>
> Key: HBASE-26103
> URL: https://issues.apache.org/jira/browse/HBASE-26103
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Affects Versions: 3.0.0-alpha-1
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> This is same as HBASE-26088  but created separate ticket for master branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26097) Resolve dependency conflicts of hbase-endpoint third-party libraries

2021-07-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383412#comment-17383412
 ] 

Michael Stack commented on HBASE-26097:
---

This is interesting. Please say more how your endpoint works now. You've moved 
to shaded endpoint, and you are able to send in pb 3.0 requests and it works? 
Thanks. If so, this looks like a means toward our deprecating the 
hbase-endpoint?

> Resolve dependency conflicts of hbase-endpoint third-party libraries
> 
>
> Key: HBASE-26097
> URL: https://issues.apache.org/jira/browse/HBASE-26097
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 2.0.0, 2.0.6
>Reporter: zyxxoo
>Priority: Major
> Fix For: 2.0.6
>
>
> Hi, Our project uses “hbase-endpoint”, but the protobuf version is 2.5 
> version, which conflicts with 3.0 in our project. I want to merge into a 
> "hbase-shaded-endpoint" submission to resolve "hbase-endpoint" dependency 
> issue of version 2.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25973) Balancer should explain progress in a better way in log

2021-07-18 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382949#comment-17382949
 ] 

Michael Stack commented on HBASE-25973:
---

[~apurt...@yahoo.com] merged the branch-2 PR.

> Balancer should explain progress in a better way in log
> ---
>
> Key: HBASE-25973
> URL: https://issues.apache.org/jira/browse/HBASE-25973
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5
>
>
> In the log, balancer logs at info level at the beginning of run:
>  {code}
> balancer.StochasticLoadBalancer: start StochasticLoadBalancer.balancer, 
> initCost=277.3479243125063, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.3749771215224234); ServerLocalityCostFunction : (25.0, 
> 0.5807483226644186); RackLocalityCostFunction : (15.0, 0.0); 
> TableSkewCostFunction : (1000.0, 0.0019704142954972883); 
> StoreFileCostFunction : (200.0, 0.3668512059459341);  computedMaxSteps: 
> 42270438200
> {code}
> the cost is reported without context, it is hard for operator to understand 
> how unbalanced the cluster is for balancer and how much progress we are 
> making.
> For a large cluster, the calculation can take a long time, we also need to 
> let operator understand that it will take up to the max time to complete the 
> calculation. 
> At the end of computation:
> {code}
> balancer.StochasticLoadBalancer: Finished computing new load balance plan. 
> Computation took PT40M0.006S to try 1036409 different iterations. Found a 
> solution that moves 161926 regions; Going from a computed cost of 
> 118.75715593924485 to a new cost of 1.5509126920967042
> {code}
> The time to compute the plan is also printed in a  format that is not human 
> readable. we also need to let operator understand that balancer is just 
> submitting the plan and it be up to execution to complete the move.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25973) Balancer should explain progress in a better way in log

2021-07-16 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382370#comment-17382370
 ] 

Michael Stack commented on HBASE-25973:
---

Merged to branch-2.3 and branch-2.4. Waiting on branch-2 backport to finish 
test before can close this.

> Balancer should explain progress in a better way in log
> ---
>
> Key: HBASE-25973
> URL: https://issues.apache.org/jira/browse/HBASE-25973
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.3.6, 3.0.0-alpha-2, 2.4.5
>
>
> In the log, balancer logs at info level at the beginning of run:
>  {code}
> balancer.StochasticLoadBalancer: start StochasticLoadBalancer.balancer, 
> initCost=277.3479243125063, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.3749771215224234); ServerLocalityCostFunction : (25.0, 
> 0.5807483226644186); RackLocalityCostFunction : (15.0, 0.0); 
> TableSkewCostFunction : (1000.0, 0.0019704142954972883); 
> StoreFileCostFunction : (200.0, 0.3668512059459341);  computedMaxSteps: 
> 42270438200
> {code}
> the cost is reported without context, it is hard for operator to understand 
> how unbalanced the cluster is for balancer and how much progress we are 
> making.
> For a large cluster, the calculation can take a long time, we also need to 
> let operator understand that it will take up to the max time to complete the 
> calculation. 
> At the end of computation:
> {code}
> balancer.StochasticLoadBalancer: Finished computing new load balance plan. 
> Computation took PT40M0.006S to try 1036409 different iterations. Found a 
> solution that moves 161926 regions; Going from a computed cost of 
> 118.75715593924485 to a new cost of 1.5509126920967042
> {code}
> The time to compute the plan is also printed in a  format that is not human 
> readable. we also need to let operator understand that balancer is just 
> submitting the plan and it be up to execution to complete the move.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25973) Balancer should explain progress in a better way in log

2021-07-16 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25973:
--
Fix Version/s: 3.0.0-alpha-2

> Balancer should explain progress in a better way in log
> ---
>
> Key: HBASE-25973
> URL: https://issues.apache.org/jira/browse/HBASE-25973
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.3.6, 3.0.0-alpha-2, 2.4.5
>
>
> In the log, balancer logs at info level at the beginning of run:
>  {code}
> balancer.StochasticLoadBalancer: start StochasticLoadBalancer.balancer, 
> initCost=277.3479243125063, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.3749771215224234); ServerLocalityCostFunction : (25.0, 
> 0.5807483226644186); RackLocalityCostFunction : (15.0, 0.0); 
> TableSkewCostFunction : (1000.0, 0.0019704142954972883); 
> StoreFileCostFunction : (200.0, 0.3668512059459341);  computedMaxSteps: 
> 42270438200
> {code}
> the cost is reported without context, it is hard for operator to understand 
> how unbalanced the cluster is for balancer and how much progress we are 
> making.
> For a large cluster, the calculation can take a long time, we also need to 
> let operator understand that it will take up to the max time to complete the 
> calculation. 
> At the end of computation:
> {code}
> balancer.StochasticLoadBalancer: Finished computing new load balance plan. 
> Computation took PT40M0.006S to try 1036409 different iterations. Found a 
> solution that moves 161926 regions; Going from a computed cost of 
> 118.75715593924485 to a new cost of 1.5509126920967042
> {code}
> The time to compute the plan is also printed in a  format that is not human 
> readable. we also need to let operator understand that balancer is just 
> submitting the plan and it be up to execution to complete the move.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25973) Balancer should explain progress in a better way in log

2021-07-16 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25973:
--
Fix Version/s: 2.4.5
   2.3.6

> Balancer should explain progress in a better way in log
> ---
>
> Key: HBASE-25973
> URL: https://issues.apache.org/jira/browse/HBASE-25973
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.3.6, 2.4.5
>
>
> In the log, balancer logs at info level at the beginning of run:
>  {code}
> balancer.StochasticLoadBalancer: start StochasticLoadBalancer.balancer, 
> initCost=277.3479243125063, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.3749771215224234); ServerLocalityCostFunction : (25.0, 
> 0.5807483226644186); RackLocalityCostFunction : (15.0, 0.0); 
> TableSkewCostFunction : (1000.0, 0.0019704142954972883); 
> StoreFileCostFunction : (200.0, 0.3668512059459341);  computedMaxSteps: 
> 42270438200
> {code}
> the cost is reported without context, it is hard for operator to understand 
> how unbalanced the cluster is for balancer and how much progress we are 
> making.
> For a large cluster, the calculation can take a long time, we also need to 
> let operator understand that it will take up to the max time to complete the 
> calculation. 
> At the end of computation:
> {code}
> balancer.StochasticLoadBalancer: Finished computing new load balance plan. 
> Computation took PT40M0.006S to try 1036409 different iterations. Found a 
> solution that moves 161926 regions; Going from a computed cost of 
> 118.75715593924485 to a new cost of 1.5509126920967042
> {code}
> The time to compute the plan is also printed in a  format that is not human 
> readable. we also need to let operator understand that balancer is just 
> submitting the plan and it be up to execution to complete the move.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26088) conn.getBufferedMutator(tableName) leaks thread executors and other problems

2021-07-16 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382327#comment-17382327
 ] 

Michael Stack commented on HBASE-26088:
---

I like the two line removal fix for branch-2. The default BM will create a pool 
and clean it up if passed a null. Not sure how many other BM implementations 
there are but a fat release note on a 2.5.0 release about change in param 
passed to BM constructor should cover us. For 2.4 and 2.3, the workaround 
suggested above I'd say. For 3.0.0, we should share the connection executor as 
the javadoc says. Good find [~whitney13]

> conn.getBufferedMutator(tableName) leaks thread executors and other problems
> 
>
> Key: HBASE-26088
> URL: https://issues.apache.org/jira/browse/HBASE-26088
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.4.13, 2.4.4
>Reporter: Whitney Jackson
>Priority: Critical
>
> TL;DR: {{conn.getBufferedMutator(tableName)}} is dangerous in hbase client 
> 2.4.4 and doesn't match documented behavior in 1.4.13.
> To work around the problems until fixed do this:
> {code:java}
> var mySingletonPool = HTable.getDefaultExecutor(hbaseConf);
> var params = new BufferedMutatorParams(tableName);
> params.pool(mySingletonPool);
> var myMutator = conn.getBufferedMutator(params);
> {code}
> And avoid code like this:
> {code:java}
> var myMutator = conn.getBufferedMutator(tableName);
> {code}
> The full story:
> My application started leaking threads after upgrading from hbase client 
> 1.4.13 to 2.4.4. So much so that after less than a minute of runtime more 
> that 30k threads are leaked and all available virtual memory on the box (> 50 
> GB) is consumed. Other processes on the box start crashing with memory 
> allocation errors. Even running {{ls}} at the shell fails with OS resource 
> allocation failures.
> A thread dump after just a few seconds of runtime shows thousands of threads 
> like this:
> {code:java}
> "htable-pool-0" #8841 prio=5 os_prio=0 cpu=0.15ms elapsed=7.49s 
> tid=0x7efb6d2a1000 nid=0x57d2 waiting on condition [0x7ef8a6c38000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
>  - parking to wait for <0x0007e7cd6188> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.6/SynchronousQueue.java:462)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.6/SynchronousQueue.java:361)
>  at 
> java.util.concurrent.SynchronousQueue.poll(java.base@11.0.6/SynchronousQueue.java:937)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.6/ThreadPoolExecutor.java:1053)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1114)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
>  at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
> {code}
>  
> Note: All the threads are labeled {{htable-pool-0}}. That suggests we're 
> leaking thread executors not just threads. The {{htable-pool}} part indicates 
> the problem is to do with {{HTable.getDefaultExecutor(conf)}} and the only 
> part of my code that interacts with that is a call to 
> {{conn.getBufferedMutator(tableName)}}.
>  
> Looking at the hbase client code shows a few problems:
> 1) Neither 1.4.13 nor 2.4.4's behavior matches the documentation for 
> {{conn.getBufferedMutator(tableName)}} which says:
> {quote}This BufferedMutator will use the Connection's ExecutorService.
> {quote}
> That suggests some singleton thread executor is being used which is not the 
> case.
>  
> 2) Under 1.4.13 you get a new {{ThreadPoolExecutor}} for every 
> {{BufferedMutator}}. That's probably not what you want but you likely won't 
> notice. I didn't. It's a code path I hadn't profiled much.
>  
> 3) Under 2.4.4 you get a new {{ThreadPoolExecutor}} for every 
> {{BufferedMutator}} *and* that {{ThreadPoolExecutor}} *is not* cleaned up 
> after the {{Mutator}} is closed. Each completed {{ThreadPoolExecutor}} 
> carries with it one thread which hangs around until a timeout value which 
> defaults to 60 seconds.
> My application creates one {{BufferedMutator}} for every incoming stream and 
> there are lots of streams, some of them are short lived so my code leaks 
> threads fast under 2.4.4.
> Here's the part where a new executor is created for every {{BufferedMutator}} 
> (it's similar for 1.4.13):
> 

[jira] [Updated] (HBASE-25739) TableSkewCostFunction need to use aggregated deviation

2021-07-16 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25739:
--
Fix Version/s: 2.4.5
   2.3.6

> TableSkewCostFunction need to use aggregated deviation
> --
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer, master
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5
>
> Attachments: 
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>  
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for 
> all tables as the measure of unevenness. It doesn't work in a very common 
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We 
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1, 
> compared to 99 in the worst case scenario of 100 regions on a single server. 
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer 
> wouldn't move.  The proposal is to use aggregated deviation of the count per 
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25739) TableSkewCostFunction need to use aggregated deviation

2021-07-16 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-25739.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

Resovling after merging PRs for branch-2.3+. Thanks for the patch 
[~clarax98007]  (and reviews [~busbey] and [~zhangduo] )

> TableSkewCostFunction need to use aggregated deviation
> --
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer, master
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5
>
> Attachments: 
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>  
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for 
> all tables as the measure of unevenness. It doesn't work in a very common 
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We 
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1, 
> compared to 99 in the worst case scenario of 100 regions on a single server. 
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer 
> wouldn't move.  The proposal is to use aggregated deviation of the count per 
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25739) TableSkewCostFunction need to use aggregated deviation

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381598#comment-17381598
 ] 

Michael Stack commented on HBASE-25739:
---

Merged the branch-2 PR. 2.4 and 2.3 seem to have related test failures.

> TableSkewCostFunction need to use aggregated deviation
> --
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer, master
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
> Attachments: 
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>  
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for 
> all tables as the measure of unevenness. It doesn't work in a very common 
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We 
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1, 
> compared to 99 in the worst case scenario of 100 regions on a single server. 
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer 
> wouldn't move.  The proposal is to use aggregated deviation of the count per 
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25739) TableSkewCostFunction need to use aggregated deviation

2021-07-15 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-25739:
--
Fix Version/s: 3.0.0-alpha-2
   2.5.0

> TableSkewCostFunction need to use aggregated deviation
> --
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer, master
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
> Attachments: 
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>  
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for 
> all tables as the measure of unevenness. It doesn't work in a very common 
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We 
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1, 
> compared to 99 in the worst case scenario of 100 regions on a single server. 
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer 
> wouldn't move.  The proposal is to use aggregated deviation of the count per 
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25985) ReplicationSourceWALReader#run - Reset sleepMultiplier in loop once out of any IOE

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381573#comment-17381573
 ] 

Michael Stack commented on HBASE-25985:
---

Thanks [~anoop.hbase] ... I closed out the PR in favor of HBASE-25992

> ReplicationSourceWALReader#run - Reset sleepMultiplier in loop once out of 
> any IOE
> --
>
> Key: HBASE-25985
> URL: https://issues.apache.org/jira/browse/HBASE-25985
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Anoop Sam John
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25985) ReplicationSourceWALReader#run - Reset sleepMultiplier in loop once out of any IOE

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381536#comment-17381536
 ] 

Michael Stack commented on HBASE-25985:
---

[~anoop.hbase] see the PR. There's a few questions. Looks like the PR has been 
applied but needs to be reverted?

> ReplicationSourceWALReader#run - Reset sleepMultiplier in loop once out of 
> any IOE
> --
>
> Key: HBASE-25985
> URL: https://issues.apache.org/jira/browse/HBASE-25985
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Anoop Sam John
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26092) JVM core dump in the replication path

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381512#comment-17381512
 ] 

Michael Stack commented on HBASE-26092:
---

With replication enabled on a ~700 node cluster, we'd lose a RS every day or so 
w/ crashes that were variants on the below (building cellblock):
{code:java}
Stack: [0x7edc2b215000,0x7edc2b316000],  sp=0x7edc2b314480,  free 
space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)J 12332 C2 
org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
 (27 bytes) @ 0x7f065ada3047 [0x7f065ada2c40+0x407]J 16249 C2 
org.apache.hadoop.hbase.ipc.CellBlockBuilder.encodeCellsTo(Ljava/io/OutputStream;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;)V
 (138 bytes) @ 0x7f065b716550 [0x7f065b716380+0x1d0]J 6822 C2 
org.apache.hadoop.hbase.ipc.CellBlockBuilder.buildCellBlock(Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/ipc/CellBlockBuilder$OutputStreamSupplier;)Z
 (113 bytes) @ 0x7f0659917424 [0x7f0659916fc0+0x464]J 6824 C2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (370 bytes) @ 0x7f065a4041f4 [0x7f065a403fc0+0x234]J 6823 C2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (30 bytes) @ 0x7f065962d414 [0x7f065962d3e0+0x34]J 5492 C2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (149 bytes) @ 0x7f0659f04f48 [0x7f0659f04c60+0x2e8]J 6996 C2 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 bytes) @ 
0x7f06599d4eec [0x7f06599d4c80+0x26c]J 27396 C2 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
 (106 bytes) @ 0x7f065c15e660 [0x7f065c15e400+0x260]J 21998% C2 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 
bytes) @ 0x7f0659de9570 [0x7f0659de9000+0x570]j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44j
  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11j
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
 {code}

> JVM core dump in the replication path
> -
>
> Key: HBASE-26092
> URL: https://issues.apache.org/jira/browse/HBASE-26092
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.3.5
>Reporter: Huaxiang Sun
>Priority: Critical
>
> When replication is turned on, we found the following code dump in the region 
> server. 
> I checked the code dump for replication. I think I got some ideas. For 
> replication, when RS receives walEdits from remote cluster, it needs to send 
> them out to final RS. In this case, NettyRpcConnection is deployed, calls are 
> queued while it refers to ByteBuffer in the context of replicationHandler 
> (returned to the pool once it returns). Code dump will happen since the 
> byteBuffer has been reused. Needs ref count in this asynchronous processing.
>  
> Feel free to take it, otherwise, I will try to work on a patch later.
>  
>  
> {code:java}
> Stack: [0x7fb1bf039000,0x7fb1bf13a000],  sp=0x7fb1bf138560,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 28175 C2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x7fd2663c [0x7fd263c0+0x27c]
> J 14912 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (370 bytes) @ 0x7fdbbb94b590 [0x7fdbbb949c00+0x1990]
> J 14911 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (30 bytes) @ 0x7fdbb972d1d4 [0x7fdbb972d1a0+0x34]
> J 30476 C2 
> 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381464#comment-17381464
 ] 

Michael Stack commented on HBASE-26042:
---

{quote}bq. if you have a heap dump from that state? Poke around the call back 
its channel ID vs closed channel and the AsyncFSOutput instance state of the 
new WAL?
{quote}
Dang. The heaps here are too big But let me try (Might be a while given 
we've done some work to undo the provocation – the DN NPE'ing and even crashing 
on 'java.lang.NullPointerException at 
sun.nio.ch.EPollArrayWrapper.isEventsHighKilled(EPollArrayWrapper.java:174)' – 
a JDK/not-enough-fds issue). Thanks B.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381416#comment-17381416
 ] 

Michael Stack commented on HBASE-26042:
---

Thanks for taking a look.

I don't have a test. I have just production logs/hread dumps (cited above)
{quote}So theoretically it should have no problem to call it outside the 
consume thread, as they should be no overlap.
{quote}
Agree. The hung WALRoller thread looks like the UT test that (in violation) has 
two threads trying to do flush concurrently. Let me see if it can happen in 
practice...

 

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-15 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381058#comment-17381058
 ] 

Michael Stack commented on HBASE-26042:
---

Played w/ [~bharathv] PR.  I can manufacture one of these w/ this PR:
{code:java}
"ForkJoinPool.commonPool-worker-19" #219 daemon prio=5 os_prio=31 cpu=20.49ms 
elapsed=48.71s tid=0x7f8bf9ced800 nid=0x21a03 waiting on condition  
[0x700013869000]
   java.lang.Thread.State: WAITING (parking)
  at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
  - parking to wait for  <0x00078c0d0498> (a 
java.util.concurrent.CompletableFuture$Signaller)
  at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
  at 
java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.11/CompletableFuture.java:1796)
  at 
java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.11/ForkJoinPool.java:3118)
  at 
java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.11/CompletableFuture.java:1823)
  at 
java.util.concurrent.CompletableFuture.get(java.base@11.0.11/CompletableFuture.java:1998)
  at 
org.apache.hadoop.hbase.io.asyncfs.TestFanOutOneBlockAsyncDFSOutput.lambda$testRecover$0(TestFanOutOneBlockAsyncDFSOutput.java:155)
  at 
org.apache.hadoop.hbase.io.asyncfs.TestFanOutOneBlockAsyncDFSOutput$$Lambda$142/0x000800454c40.run(Unknown
 Source)
  at 
java.util.concurrent.CompletableFuture$AsyncRun.run$$$capture(java.base@11.0.11/CompletableFuture.java:1736)
  at 
java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.11/CompletableFuture.java)
  at 
java.util.concurrent.CompletableFuture$AsyncRun.exec(java.base@11.0.11/CompletableFuture.java:1728)
  at 
java.util.concurrent.ForkJoinTask.doExec$$$capture(java.base@11.0.11/ForkJoinTask.java:290)
  at 
java.util.concurrent.ForkJoinTask.doExec(java.base@11.0.11/ForkJoinTask.java)
  at 
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11.0.11/ForkJoinPool.java:1020)
  at 
java.util.concurrent.ForkJoinPool.scan(java.base@11.0.11/ForkJoinPool.java:1656)
  at 
java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.11/ForkJoinPool.java:1594)
  at 
java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.11/ForkJoinWorkerThread.java:183)
 {code}
This looks like:
{code:java}
 "regionserver/ps1532:16020.logRoller" #395 daemon prio=5 os_prio=0 
tid=0x7f7c4403b800 nid=0xa2a7 waiting on condition [0x7f51f3c4c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f682625cc00> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:651)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:128)
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:797)
at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:263)
at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:179){code}
In the test, there are two threads calling flush. This is in violation of 
FanOutOneBlockAsyncDFSOutput class comment as it states it is not thread-safe – 
for use by the single thread consume executor – so the two threads mess up each 
others' state (if they are sequenced they both fail properly w/ broken stream 
exceptions) but the lockup looks similar. There is a flush called in 
AsyncProtobufLogWriter#writeMagicAndWALHeader; i.e. not by the consume thread.

Still digging.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
>   

[jira] [Commented] (HBASE-24984) WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used with multi operation

2021-07-14 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380897#comment-17380897
 ] 

Michael Stack commented on HBASE-24984:
---

[~gouravk] did you mean to close the PR?

> WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used 
> with multi operation
> --
>
> Key: HBASE-24984
> URL: https://issues.apache.org/jira/browse/HBASE-24984
> Project: HBase
>  Issue Type: Bug
>  Components: rpc, wal
>Affects Versions: 2.1.6
>Reporter: Liu Junhong
>Assignee: Gaurav Kanade
>Priority: Critical
> Fix For: 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5
>
>
> After bugfix HBASE-22539, When client use BufferedMutator or multiple 
> mutation , there will be one RpcCall and mutliple FSWALEntry .  At the time 
> RpcCall finish and one FSWALEntry call release() , the remain FSWALEntries 
> may trigger RuntimeException or segmentation fault .
> We should use RefCnt  instead of AtomicInteger for 
> org.apache.hadoop.hbase.ipc.ServerCall.reference?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26036) DBB released too early and dirty data for some operations

2021-07-14 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380732#comment-17380732
 ] 

Michael Stack commented on HBASE-26036:
---

I backported this beautiful fix to 2.4.5. It wouldn't go back to branch-2.3 
cleanly, unfortunately.

> DBB released too early and dirty data for some operations
> -
>
> Key: HBASE-26036
> URL: https://issues.apache.org/jira/browse/HBASE-26036
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.5
>
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on 
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x7f621ba8d000,0x7f621bb8e000],  sp=0x7f621bb8c0e0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 
> bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d]
> J 22844 C2 
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
>  (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24]
> J 17972 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
>  (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90]
> J 26197 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc]
> J 26332 C2 
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
>  (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ 
> 0x7f6a60711a4c [0x7f6a60711000+0xa4c]
> J 19656% C2 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
>  (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4]
> j  org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187,the check result of the checkAndMutate will be false, 
> because it read wrong/dirty data from the released ByteBuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-26036) DBB released too early and dirty data for some operations

2021-07-14 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26036:
--
Fix Version/s: 2.4.5

> DBB released too early and dirty data for some operations
> -
>
> Key: HBASE-26036
> URL: https://issues.apache.org/jira/browse/HBASE-26036
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.5
>
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on 
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x7f621ba8d000,0x7f621bb8e000],  sp=0x7f621bb8c0e0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 
> bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d]
> J 22844 C2 
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
>  (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24]
> J 17972 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
>  (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90]
> J 26197 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc]
> J 26332 C2 
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
>  (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ 
> 0x7f6a60711a4c [0x7f6a60711000+0xa4c]
> J 19656% C2 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
>  (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4]
> j  org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187,the check result of the checkAndMutate will be false, 
> because it read wrong/dirty data from the released ByteBuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25739) TableSkewCostFunction need to use aggregated deviation

2021-07-13 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379967#comment-17379967
 ] 

Michael Stack commented on HBASE-25739:
---

Merged to master. Want to make PRs for branch-2 [~clarax98007]  (I tried 
backport but failures)

> TableSkewCostFunction need to use aggregated deviation
> --
>
> Key: HBASE-25739
> URL: https://issues.apache.org/jira/browse/HBASE-25739
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer, master
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Attachments: 
> TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml,
>  
> org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt
>
>
> TableSkewCostFunction uses the sum of the max deviation region per server for 
> all tables as the measure of unevenness. It doesn't work in a very common 
> scenario in operations. Say we have 100 regions on 50 nodes, two on each. We 
> add 50 new nodes and they have 0 each. The max deviation from the mean is 1, 
> compared to 99 in the worst case scenario of 100 regions on a single server. 
> The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer 
> wouldn't move.  The proposal is to use aggregated deviation of the count per 
> region server to detect this scenario, generating a cost of 100/198 = 0.5 in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25973) Balancer should explain progress in a better way in log

2021-07-12 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379294#comment-17379294
 ] 

Michael Stack commented on HBASE-25973:
---

I merged the PR to master branch but it does not go back to branch-2 cleanly. 
Do  you want to put up a PR [~clarax98007]

> Balancer should explain progress in a better way in log
> ---
>
> Key: HBASE-25973
> URL: https://issues.apache.org/jira/browse/HBASE-25973
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>
> In the log, balancer logs at info level at the beginning of run:
>  {code}
> balancer.StochasticLoadBalancer: start StochasticLoadBalancer.balancer, 
> initCost=277.3479243125063, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.3749771215224234); ServerLocalityCostFunction : (25.0, 
> 0.5807483226644186); RackLocalityCostFunction : (15.0, 0.0); 
> TableSkewCostFunction : (1000.0, 0.0019704142954972883); 
> StoreFileCostFunction : (200.0, 0.3668512059459341);  computedMaxSteps: 
> 42270438200
> {code}
> the cost is reported without context, it is hard for operator to understand 
> how unbalanced the cluster is for balancer and how much progress we are 
> making.
> For a large cluster, the calculation can take a long time, we also need to 
> let operator understand that it will take up to the max time to complete the 
> calculation. 
> At the end of computation:
> {code}
> balancer.StochasticLoadBalancer: Finished computing new load balance plan. 
> Computation took PT40M0.006S to try 1036409 different iterations. Found a 
> solution that moves 161926 regions; Going from a computed cost of 
> 118.75715593924485 to a new cost of 1.5509126920967042
> {code}
> The time to compute the plan is also printed in a  format that is not human 
> readable. we also need to let operator understand that balancer is just 
> submitting the plan and it be up to execution to complete the move.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM

2021-07-09 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378168#comment-17378168
 ] 

Michael Stack commented on HBASE-25720:
---

[~Xiaolin Ha] I ask because I'm looking at a related issue around AsyncFSWAL – 
HBASE-26042. Thanks.

> Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
> --
>
> Key: HBASE-25720
> URL: https://issues.apache.org/jira/browse/HBASE-25720
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Attachments: prepare-flush-cache-stuck.png
>
>
> We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. 
> But this WAL sync may stuck, and abort the flush of cache. 
> !prepare-flush-cache-stuck.png|width=519,height=246!
> If we cannot aware of this problem in time, RS will OOM kill.
> I think we should force abort RS when sync stuck in preparing, like in 
> committing snapshots.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM

2021-07-08 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377532#comment-17377532
 ] 

Michael Stack commented on HBASE-25720:
---

Anything in the log before your png? That shows perhaps how or why the WAL 
system is stuck? A jstack? Thanks [~Xiaolin Ha]

> Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
> --
>
> Key: HBASE-25720
> URL: https://issues.apache.org/jira/browse/HBASE-25720
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Attachments: prepare-flush-cache-stuck.png
>
>
> We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. 
> But this WAL sync may stuck, and abort the flush of cache. 
> !prepare-flush-cache-stuck.png|width=519,height=246!
> If we cannot aware of this problem in time, RS will OOM kill.
> I think we should force abort RS when sync stuck in preparing, like in 
> committing snapshots.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376820#comment-17376820
 ] 

Michael Stack commented on HBASE-26042:
---

[~bharathv] let me try. Sweet.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> 

[jira] [Comment Edited] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376817#comment-17376817
 ] 

Michael Stack edited comment on HBASE-26042 at 7/7/21, 8:20 PM:


Tried w/ 2.3.5 and 2.4.3. Production has 2.3.5. The wal roll 'fixes' the 
pipeline. I can't repro the abort/hang.


was (Author: stack):
Tried w/ 2.3.5 and 2.4.3. The wal roll 'fixes' the pipeline. I can't repro the 
abort/hang.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376817#comment-17376817
 ] 

Michael Stack commented on HBASE-26042:
---

Tried w/ 2.3.5 and 2.4.3. The wal roll 'fixes' the pipeline. I can't repro the 
abort/hang.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376749#comment-17376749
 ] 

Michael Stack commented on HBASE-26042:
---

Reproduced by killing non-local DN:
{code:java}
2021-07-07 16:50:41,361 WARN 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
readAddress(..) failed: Connection reset by peer {code}
Recovers fine though. Sequence to lock up WAL must be more involved. Need more 
info.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 

[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376703#comment-17376703
 ] 

Michael Stack commented on HBASE-25761:
---

Pardon my misunderstanding [~bharathv]

Lets just keep the time posted on the dev list: 5pm our time and 8AM Duo's. 
I'll keep notes and post them after.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376684#comment-17376684
 ] 

Michael Stack commented on HBASE-26042:
---

Tried reproducing
{code:java}
2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
wal.AsyncFSWAL: sync 
failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
 readAddress(..) failed: Connection reset by peer  {code}
... by killing DNs. There is complaint about broken stream but then all gets 
neatly closed up and after log roll, all is good again.

Pausing the DN process got me these:
{code:java}
2021-07-07 14:44:38,471 WARN 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed
java.io.IOException: Timeout(6ms) waiting for response
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$userEventTriggered$4(FanOutOneBlockAsyncDFSOutput.java:300)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:233)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98)
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.userEventTriggered(FanOutOneBlockAsyncDFSOutput.java:299)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324)
at 
org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324)
at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelIdle(IdleStateHandler.java:371)
at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:504)
at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:476)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 {code}
...not the same as what I want. Here the exception has a nice stack trace and 
is being 'handled'.

Related, on another cluster, where fds were 64k and I'd loaded up a single RS 
w/ hundreds of Regions, I saw this:
{code:java}
2021-07-07 00:59:27,372 WARN 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline: An 
exceptionCaught() event was fired, and it reached at the tail of the pipeline. 
It usually means the last handler in the pipeline did not handle the 
exception.org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
 accept(..) failed: Too many open files {code}
Indeed lsof confirmed fds were at limit. This WARN looks a bit like the one I'm 
trying to deal w/ here w/ some nice extra info out of netty.

[~huaxiangsun] found this CASSANDRA-13649 which has similar looking WARNs from 
netty and accredits lack of an exception handler on the netty pipeline as 
reason for the WARNs showing in logs.

We don't have an exception handler on our server-side pipeline. Adding one that 
does cleanup might prevent 

[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-07 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376682#comment-17376682
 ] 

Michael Stack commented on HBASE-25761:
---

[~bharathv] If we start at 4:15, can you be online for start of meeting?

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-06 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376228#comment-17376228
 ] 

Michael Stack commented on HBASE-25761:
---

[~bharathv] we could start at 4:15? Duo can accommodate. I pinged Francis and 
he is good. Shout if it works for you and I'll update the dev list. Thanks.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25701) RegionServer JVM crash when append wal entry

2021-07-06 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376223#comment-17376223
 ] 

Michael Stack commented on HBASE-25701:
---

Linking HBASE-26062 

> RegionServer JVM crash when append wal entry
> 
>
> Key: HBASE-25701
> URL: https://issues.apache.org/jira/browse/HBASE-25701
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.6
>Reporter: Juanjuan Tian 
>Priority: Major
>
> Region Server JVM crash when append wal entry,  JVM crash log:
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # EXCEPTION_ACCESS_VIOLATION (0xc005) at pc=0x027af93f, 
> pid=17992, tid=0x2d54
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_202-b08) (build 
> 1.8.0_202-b08)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode 
> windows-amd64 compressed oops)
> # Problematic frame:
> # J 10214 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getFamilyLength()B (9 
> bytes) @ 0x027af93f [0x027af860+0xdf] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-06 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376127#comment-17376127
 ] 

Michael Stack commented on HBASE-25761:
---

[~bharathv] Sorry. Was meant for you (Should have checked the auto-complete). 
Yes. Join when you can. Would be great to have your input.  Should I try and do 
it an hour later?

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-06 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375777#comment-17375777
 ] 

Michael Stack commented on HBASE-25761:
---

Lets do Thursday in China – 8AM your time and 5PM PST. Let me put up notice on 
dev list.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25761) POC: hbase:meta,,1 as ROOT

2021-07-05 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374961#comment-17374961
 ] 

Michael Stack commented on HBASE-25761:
---

[~toffer] Thanks. Lets do a call Thursday if it works for folks – around 5pm so 
works for [~zhangduo] . [~toffer] , [~zhangduo]  [~bhrd]  [~vjasani] ? If ok by 
you lot will put up calendar invite and announce on dev. Topic would be the 
Francis one-pager as way to go forward or not (if so, work hard to land basics 
before a 3.0.0 beta). Thanks.

> POC: hbase:meta,,1 as ROOT
> --
>
> Key: HBASE-25761
> URL: https://issues.apache.org/jira/browse/HBASE-25761
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Michael Stack
>Assignee: Francis Christopher Liu
>Priority: Major
>
> One of the proposals up in the split-meta design doc suggests a 
> sleight-of-hand where the current hard-coded hbase:meta,,1 Region is 
> leveraged to serve as first Region of a split hbase:meta but also does 
> double-duty as 'ROOT'. This suggestion was put aside as a complicating 
> recursion in chat but then Francis noticed on a re-read of the BigTable 
> paper, that this is how they describe they do 'ROOT': "The root tablet is 
> just the first tablet in the METADATA table, but is treated specially -- it 
> is never split..."
> This issue is for playing around with this notion to see what the problems 
> are so can do a better description of this approach here, in the design:
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit?ts=606c120f#heading=h.ikbhxlcthjle



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26036) DBB released too early and dirty data for some operations

2021-07-05 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374959#comment-17374959
 ] 

Michael Stack commented on HBASE-26036:
---

I was going to ask how we guarantee integrity between return from HRegion#get 
and the copy on to the wire to send back to the client over RPC and we can't 
per [~anoop.hbase] 's helpful explanation. As suggested here and in PR, needs 
big fat WARNING.  Good stuff.

> DBB released too early and dirty data for some operations
> -
>
> Key: HBASE-26036
> URL: https://issues.apache.org/jira/browse/HBASE-26036
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Critical
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on 
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x7f621ba8d000,0x7f621bb8e000],  sp=0x7f621bb8c0e0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 
> bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d]
> J 22844 C2 
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
>  (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24]
> J 17972 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
>  (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90]
> J 26197 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc]
> J 26332 C2 
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
>  (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ 
> 0x7f6a60711a4c [0x7f6a60711000+0xa4c]
> J 19656% C2 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
>  (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4]
> j  org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187,the check result of the checkAndMutate will be false, 
> because it read wrong/dirty data from the released ByteBuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-05 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374957#comment-17374957
 ] 

Michael Stack commented on HBASE-26062:
---

{quote}...so should not be related to Scan/Lease expiry right?
{quote}
[~anoop.hbase] makes sense... thanks

It must be dbb compares. byte array compares are not going to backed w 
someone elses memory

i think heap usage high here.  i see some 1-3 second gc pausing.  I'm pretty 
sure no aync_wal but let me check (lots of clients...)  Thanks for taking a look

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c 

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373857#comment-17373857
 ] 

Michael Stack commented on HBASE-26062:
---

Only thing around the crash are these:

2021-07-01 02:31:57,075 INFO [regionserver/ps1366:16020.leaseChecker] 
regionserver.RSRpcServices: Scanner 6368110781038228097 lease expired on region 
X,a@\x00\x1F\xEC"\xC1\xD2\xA8nFrq\xE3P\xAD,1615345092618.8637491069310348cd6667c89af4b16b.

Close of the scanner releases buffer later used by AsyncFSWAL#consume?

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c 

[jira] [Comment Edited] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373806#comment-17373806
 ] 

Michael Stack edited comment on HBASE-26062 at 7/3/21, 12:14 AM:
-

Made this an issue (was a sub-issue of HBASE-26042). I don't think it related 
to HBASE-26042 now. I think it something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the 
inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect 
in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
  this.codec, this.compressionCodec, dup);
} {code}
 

Update: took a look back at one of the crashes. In server-side logs at least, 
just struggling server... lots of slow syncs and a couple of minutes back, a GC 
pause of three seconds else nothing untoward.


was (Author: stack):
Made this an issue. I don't think it related to HBASE-26042. I think it 
something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the 
inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect 
in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
  this.codec, this.compressionCodec, dup);
} {code}

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373806#comment-17373806
 ] 

Michael Stack commented on HBASE-26062:
---

Made this an issue. I don't think it related to HBASE-26042. I think it 
something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the 
inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect 
in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
  this.codec, this.compressionCodec, dup);
} {code}

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> 

[jira] [Updated] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-26062:
--
Parent: (was: HBASE-26036)
Issue Type: Bug  (was: Sub-task)

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c [0x7fcc2b7d92c0+0xac]J 17162 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7fcc29bc05e8 [0x7fcc29bbfe80+0x768]J 16934 C2 
> 

[jira] [Created] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)
Michael Stack created HBASE-26062:
-

 Summary: SIGSEGV in AsyncFSWAL consume
 Key: HBASE-26062
 URL: https://issues.apache.org/jira/browse/HBASE-26062
 Project: HBase
  Issue Type: Sub-task
Reporter: Michael Stack


Seems related to the parent issue. Its happened a few times on one of our 
clusters here. Below are two examples. Need more detail but perhaps the call 
has timed out, the buffer has thus been freed, but the late consume on the 
other side of the ringbuffer doesn't know that and goes ahead (Just 
speculation).

 
{code:java}
#  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700

RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
oopjava.nio.DirectByteBuffer - klass: 
'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
during error reporting (printing register info), id 0xb]

Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)J 23901 C2 
java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
 (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
 (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
 (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 bytes) 
@ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
 (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
[libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, 
Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
[libpthread.so.0+0x76ba]  start_thread+0xca {code}
 

This one is from a month previous and has a deeper stack... we're trying to 
read a Cell...

 
{code:java}
Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)J 30665 C2 
org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
 (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
 (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
 (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
 (79 bytes) @ 0x7fcc2b7d936c [0x7fcc2b7d92c0+0xac]J 17162 C2 
java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
 (7 bytes) @ 0x7fcc29bc05e8 [0x7fcc29bbfe80+0x768]J 16934 C2 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
 (10 bytes) @ 0x7fcc2bb313f8 [0x7fcc2bb30c60+0x798]J 30732 C2 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
bytes) @ 0x7fcc2ae5a420 [0x7fcc2ae59d60+0x6c0]J 22203 C2 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 

[jira] [Commented] (HBASE-26036) DBB released too early in HRegion.get() and dirty data for some operations

2021-07-01 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373158#comment-17373158
 ] 

Michael Stack commented on HBASE-26036:
---

Sweet. Thanks [~Xiaolin Ha] for the explanation.

I tried [https://github.com/apache/hbase/pull/3449  
|https://github.com/apache/hbase/pull/3449,]Is it supposed to fail? It doesn't 
for me on linux/mac hbase-2.3. I hacked 
[#3436|https://github.com/apache/hbase/pull/3436] pr so only the test and 
support for alternate BYTEBUFF_ALLOCATOR_CLASS and it fails.  Nice.

> DBB released too early in HRegion.get() and dirty data for some operations
> --
>
> Key: HBASE-26036
> URL: https://issues.apache.org/jira/browse/HBASE-26036
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Critical
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on 
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x7f621ba8d000,0x7f621bb8e000],  sp=0x7f621bb8c0e0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 
> bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d]
> J 22844 C2 
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
>  (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24]
> J 17972 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
>  (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90]
> J 26197 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc]
> J 26332 C2 
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
>  (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ 
> 0x7f6a60711a4c [0x7f6a60711000+0xa4c]
> J 19656% C2 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
>  (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4]
> j  org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187,the check result of the checkAndMutate will be false, 
> because it read wrong/dirty data from the released ByteBuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >