[jira] [Commented] (HBASE-27126) Support multi-threads cleaner for MOB files

2024-03-11 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825210#comment-17825210
 ] 

Xiaolin Ha commented on HBASE-27126:


Hi [~chandrasekhar.k] , please feel free to take it. 

> Support multi-threads cleaner for MOB files
> ---
>
> Key: HBASE-27126
> URL: https://issues.apache.org/jira/browse/HBASE-27126
> Project: HBase
>  Issue Type: Improvement
>  Components: mob
>Affects Versions: 2.4.12
>Reporter: Xiaolin Ha
>Priority: Major
> Fix For: 3.0.0-beta-2
>
>
> Just like the muti-threads in hfile cleaner.
> When there are many tables has MOB files, only one thread for cleaning them 
> is not enough. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-27126) Support multi-threads cleaner for MOB files

2024-03-11 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-27126:
--

Assignee: Chandra Sekhar K

> Support multi-threads cleaner for MOB files
> ---
>
> Key: HBASE-27126
> URL: https://issues.apache.org/jira/browse/HBASE-27126
> Project: HBase
>  Issue Type: Improvement
>  Components: mob
>Affects Versions: 2.4.12
>Reporter: Xiaolin Ha
>Assignee: Chandra Sekhar K
>Priority: Major
> Fix For: 3.0.0-beta-2
>
>
> Just like the muti-threads in hfile cleaner.
> When there are many tables has MOB files, only one thread for cleaning them 
> is not enough. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28028) Read all compressed bytes to a byte array before submitting them to decompressor

2024-03-10 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825149#comment-17825149
 ] 

Xiaolin Ha commented on HBASE-28028:


Hi [~bbeaudreault] , it has been running stably in our online clusters for 
months. I think it's much more stable than before for WAL compression + 
replication.  Thank you [~zhangduo]  and [~apurtell]  for their contribution!

> Read all compressed bytes to a byte array before submitting them to 
> decompressor
> 
>
> Key: HBASE-28028
> URL: https://issues.apache.org/jira/browse/HBASE-28028
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26816) Fix CME in ReplicationSourceManager

2024-01-29 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812126#comment-17812126
 ] 

Xiaolin Ha commented on HBASE-26816:


OK, got it. Thanks, [~bbeaudreault] .

> Fix CME in ReplicationSourceManager
> ---
>
> Key: HBASE-26816
> URL: https://issues.apache.org/jira/browse/HBASE-26816
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-3, 2.4.11, 2.5.8
>
>
> Exception in thread "regionserver/hostname/ip:port" 
> java.util.ConcurrentModificationException
>         at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>         at java.util.ArrayList$Itr.next(ArrayList.java:851)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:832)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:162)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:155)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2623)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1175)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28227) Tables to which Stripe Compaction policy is applied cannot be forced to trigger Major Compaction.

2024-01-28 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811730#comment-17811730
 ] 

Xiaolin Ha commented on HBASE-28227:


I think it's the original design for the stripe store engine not supporting 
entire files major compaction, because it performs major compactions in each 
stripe. More info in the design doc of HBASE-7667 .  

Do you want to trigger major compaction to erase all the existing stripe info 
or just want to make majors in the existing stripes? I think if for the issues 
of bulkloads or large file skipps in the ExploringCompactionPolicy, there are 
some compaction configs can be adjusted to resolve them. But if you want to 
erase the existing stripe info, you can provide a new compaction request type 
without reusing the `selectSingleStripeCompaction` in the PR. Just some 
advices, thanks.

> Tables to which Stripe Compaction policy is applied cannot be forced to 
> trigger Major Compaction.
> -
>
> Key: HBASE-28227
> URL: https://issues.apache.org/jira/browse/HBASE-28227
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 2.2.6
>Reporter: longping_jie
>Priority: Major
>
>     There is a table and the Stripe Compaction strategy is applied. Each 
> region has an average value of 40G and is divided into 8 Stripes. Each Stripe 
> is 5G. The business deletes a large amount of data. Manually triggering major 
> compaction on the entire table and a single region does not work and cannot 
> be selected.
>     After reading the source code, the merging strategy applied under each 
> Stripe is ExploringCompactionPolicy. This strategy has a key point. It 
> filters the Store file list of a single Stripe. In the candidate file list, 
> as long as there is a file that is too large in size and meets the condition, 
> fileSize > (totalFileSize - fileSize) * (hbase.hstore.compaction.ratio 
> default value 1.2), files will not be filtered out to participate in major 
> compaction.
>     It is necessary to support a forced merging mechanism. For scenarios 
> where a large amount of data is deleted, or where bulkload exists, you can 
> explicitly pass in a parameter such as foreMajor when manually triggering the 
> major, and then perform forced Major Compaction in Stripe units to support 
> the data. Clean up.
>     



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28197) The configs for using meta replica can conflict

2023-11-12 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-28197:
--

 Summary: The configs for using meta replica can conflict
 Key: HBASE-28197
 URL: https://issues.apache.org/jira/browse/HBASE-28197
 Project: HBase
  Issue Type: Bug
  Components: meta replicas
Affects Versions: 2.5.6, 3.0.0-alpha-4
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


The config of "hbase.locator.meta.replicas.mode" can surpass switching off 
reading from meta replica by "hbase.meta.replicas.use".

UTs by setting hbase.meta.replicas.use=false and 
hbase.locator.meta.replicas.mode=LoadBalance can recur the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27766) Support steal job queue mode for read RPC queues of RWQueueRpcExecutor

2023-10-09 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773227#comment-17773227
 ] 

Xiaolin Ha commented on HBASE-27766:


|Using YCSB for performance test, defined params: -p maxexecutiontime=600 -p 
clientbuffering=false -p readproportion=0.9 -p scanproportion=0.1 -p 
readallfields=true -threads 400 -s, on two regionservers, and for table of 100 
regions, per region size is about 2GB(uncompressed), the result is as follows,
with steal queue:
[OVERALL], RunTime(ms), 600196
[OVERALL], Throughput(ops/sec), 10980.414731187811
[TOTAL_GCS_PS_Scavenge], Count, 327
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 9728
[TOTAL_GC_TIME_%{_}PS_Scavenge], Time(%), 1.6208038707355599{_}
_[TOTAL_GCS_PS_MarkSweep], Count, 2_
_[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 243_
{_}[TOTAL_GC_TIME{_}%{_}PS_MarkSweep], Time(%), 0.04048677432038867{_}
_[TOTAL_GCs], Count, 329_
_[TOTAL_GC_TIME], Time(ms), 9971_
{_}[TOTAL_GC_TIME{_}%], Time(%), 1.6612906450559481
[READ], Operations, 5932076
[READ], AverageLatency(us), 19541.46970419799
[READ], MinLatency(us), 126
[READ], MaxLatency(us), 1187839
[READ], 95thPercentileLatency(us), 50463
[READ], 99thPercentileLatency(us), 130111
[READ], Return=OK, 5932076
[CLEANUP], Operations, 800
[CLEANUP], AverageLatency(us), 39.74125
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 27807
[CLEANUP], 95thPercentileLatency(us), 13
[CLEANUP], 99thPercentileLatency(us), 32
[SCAN], Operations, 658325
[SCAN], AverageLatency(us), 187782.97140622034
[SCAN], MinLatency(us), 526
[SCAN], MaxLatency(us), 1017855
[SCAN], 95thPercentileLatency(us), 333055
[SCAN], 99thPercentileLatency(us), 404991
[SCAN], Return=OK, 658325
without steal queue:
[OVERALL], RunTime(ms), 600232
[OVERALL], Throughput(ops/sec), 10738.177904543576
[TOTAL_GCS_PS_Scavenge], Count, 316
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 10203
[TOTAL_GC_TIME_%{_}PS_Scavenge], Time(%), 1.6998427274787082{_}
_[TOTAL_GCS_PS_MarkSweep], Count, 2_
_[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 286_
{_}[TOTAL_GC_TIME{_}%{_}PS_MarkSweep], Time(%), 0.04764824267949726{_}
_[TOTAL_GCs], Count, 318_
_[TOTAL_GC_TIME], Time(ms), 10489_
{_}[TOTAL_GC_TIME{_}%], Time(%), 1.7474909701582058
[READ], Operations, 5799511
[READ], AverageLatency(us), 20715.723915516326
[READ], MinLatency(us), 115
[READ], MaxLatency(us), 793087
[READ], 95thPercentileLatency(us), 54847
[READ], 99thPercentileLatency(us), 130047
[READ], Return=OK, 5799511
[CLEANUP], Operations, 800
[CLEANUP], AverageLatency(us), 43.0025
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 29631
[CLEANUP], 95thPercentileLatency(us), 16
[CLEANUP], 99thPercentileLatency(us), 29
[SCAN], Operations, 645887
[SCAN], AverageLatency(us), 185176.07673942964
[SCAN], MinLatency(us), 534
[SCAN], MaxLatency(us), 923135
[SCAN], 95thPercentileLatency(us), 348927
[SCAN], 99thPercentileLatency(us), 432383
[SCAN], Return=OK, 645887|
 

> Support steal job queue mode for read RPC queues of RWQueueRpcExecutor
> --
>
> Key: HBASE-27766
> URL: https://issues.apache.org/jira/browse/HBASE-27766
> Project: HBase
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> Currently, the RPC queues are distinguished by request type, under most 
> circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
> while reads queues are always divided by get requests and scan requests. The 
> reason why we isolate the scan requests and get requests is that we do not 
> want large scans block small gets.
> Since the handler resources for a regionserver is limited and we can't 
> dynamicly change the handler ratio by the ratio of requests. We should both 
> keep large scan and the small gets be isolated, and let the idle handlers for 
> the samller ratio scans to handle some gets when the gets handlers are busy.
> This steal queue idea can also be used in other circumstances, e.g. idle read 
> handler steal jobs from write queus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28037) Replication stuck after switching to new WAL but the queue is empty

2023-09-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-28037.

Resolution: Fixed

Merged to branch-2.4 and branch-2.5, thanks [~zhangduo] for reviewing.

> Replication stuck after switching to new WAL but the queue is empty
> ---
>
> Key: HBASE-28037
> URL: https://issues.apache.org/jira/browse/HBASE-28037
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-4, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Blocker
> Fix For: 2.4.18, 2.5.6
>
>
> When the speed of consuming replication WALs is high, and there are something 
> wrong when creating new WAL, the swith of replcation source reader to new WAL 
> in the queue may happen before the new WAL is created, then the replcation 
> will stuck since it can not consume the new WALs soon afterwards anymore. 
> Restarting the RS that replication stucking can make the replication recover.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28037) Replication stuck after switching to new WAL but the queue is empty

2023-09-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-28037:
--

Assignee: Xiaolin Ha

> Replication stuck after switching to new WAL but the queue is empty
> ---
>
> Key: HBASE-28037
> URL: https://issues.apache.org/jira/browse/HBASE-28037
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-4, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Blocker
> Fix For: 2.4.18, 2.5.6
>
>
> When the speed of consuming replication WALs is high, and there are something 
> wrong when creating new WAL, the swith of replcation source reader to new WAL 
> in the queue may happen before the new WAL is created, then the replcation 
> will stuck since it can not consume the new WALs soon afterwards anymore. 
> Restarting the RS that replication stucking can make the replication recover.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28047) Deadlock when opening mob files

2023-09-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-28047.

Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
   Resolution: Fixed

Merged to master and branch-2+, thanks [~zhangduo] for reviewing.

> Deadlock when opening mob files
> ---
>
> Key: HBASE-28047
> URL: https://issues.apache.org/jira/browse/HBASE-28047
> Project: HBase
>  Issue Type: Bug
>  Components: mob
>Affects Versions: 3.0.0-alpha-4, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
> Attachments: mobdeadlock.js
>
>
> The hashcode of mob file name is used in MobFileCache to lock the cached mob 
> files, but hashcode may be repeated and the IdLock is not reentrant. Then 
> when opening a not cached file with evicting the opened by LRU, the repeated 
> hashcode files will bring deadlock.
> [^mobdeadlock.js]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28047) Deadlock when opening mob files

2023-09-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-28047:
--

Assignee: Xiaolin Ha

> Deadlock when opening mob files
> ---
>
> Key: HBASE-28047
> URL: https://issues.apache.org/jira/browse/HBASE-28047
> Project: HBase
>  Issue Type: Bug
>  Components: mob
>Affects Versions: 3.0.0-alpha-4, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Attachments: mobdeadlock.js
>
>
> The hashcode of mob file name is used in MobFileCache to lock the cached mob 
> files, but hashcode may be repeated and the IdLock is not reentrant. Then 
> when opening a not cached file with evicting the opened by LRU, the repeated 
> hashcode files will bring deadlock.
> [^mobdeadlock.js]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28037) Replication stuck after switching to new WAL but the queue is empty

2023-08-29 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759861#comment-17759861
 ] 

Xiaolin Ha commented on HBASE-28037:


Seems brings by HBASE-25596, we should stop the source reader when it is a 
recovered queue and is empty.

> Replication stuck after switching to new WAL but the queue is empty
> ---
>
> Key: HBASE-28037
> URL: https://issues.apache.org/jira/browse/HBASE-28037
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-4, 2.5.5
>Reporter: Xiaolin Ha
>Priority: Major
>
> When the speed of consuming replication WALs is high, and there are something 
> wrong when creating new WAL, the swith of replcation source reader to new WAL 
> in the queue may happen before the new WAL is created, then the replcation 
> will stuck since it can not consume the new WALs soon afterwards anymore. 
> Restarting the RS that replication stucking can make the replication recover.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28028) Read all compressed bytes to a byte array before submitting them to decompressor

2023-08-28 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759826#comment-17759826
 ] 

Xiaolin Ha commented on HBASE-28028:


(y) It brings hope of an end to the impressively unstable WAL compression + 
replication. I'll try it on our clusters. Thanks.

> Read all compressed bytes to a byte array before submitting them to 
> decompressor
> 
>
> Key: HBASE-28028
> URL: https://issues.apache.org/jira/browse/HBASE-28028
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28047) Deadlock when opening mob files

2023-08-27 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-28047:
--

 Summary: Deadlock when opening mob files
 Key: HBASE-28047
 URL: https://issues.apache.org/jira/browse/HBASE-28047
 Project: HBase
  Issue Type: Bug
  Components: mob
Affects Versions: 2.5.5, 3.0.0-alpha-4
Reporter: Xiaolin Ha
 Attachments: mobdeadlock.js

The hashcode of mob file name is used in MobFileCache to lock the cached mob 
files, but hashcode may be repeated and the IdLock is not reentrant. Then when 
opening a not cached file with evicting the opened by LRU, the repeated 
hashcode files will bring deadlock.

[^mobdeadlock.js]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28037) Replication stuck after switching to new WAL but the queue is empty

2023-08-21 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-28037:
--

 Summary: Replication stuck after switching to new WAL but the 
queue is empty
 Key: HBASE-28037
 URL: https://issues.apache.org/jira/browse/HBASE-28037
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.5.5, 3.0.0-alpha-4
Reporter: Xiaolin Ha


When the speed of consuming replication WALs is high, and there are something 
wrong when creating new WAL, the swith of replcation source reader to new WAL 
in the queue may happen before the new WAL is created, then the replcation will 
stuck since it can not consume the new WALs soon afterwards anymore. Restarting 
the RS that replication stucking can make the replication recover.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28036) Record the replication start timestamp for tables/peers

2023-08-21 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-28036:
--

 Summary: Record the replication start timestamp for tables/peers
 Key: HBASE-28036
 URL: https://issues.apache.org/jira/browse/HBASE-28036
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 2.5.5, 3.0.0-alpha-4
Reporter: Xiaolin Ha


Currently, the peer info showed on the UI miss the start timestamp for the 
replications, we need the start timestamp (e.g. the create time of the peer, 
the time when tables added to peers) to distinguish if the replication is 
continuous or discontinuous.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27962) Introduce an AdaptiveFastPathRWRpcExecutor to make the W/R/S separations fit various workloads

2023-07-14 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743115#comment-17743115
 ] 

Xiaolin Ha commented on HBASE-27962:


I think this idea is to resolve the fixed ratio of handlers, but the shared 
ratio is still fixed. And what's more, shared handlers broken the isolated 
process model of reads and writes by 'Shared handler could run all the three 
kinds of requests'.  This is a very serious problem when there are huge 
requests, since shared handlers will make all types of requests blocked, though 
it has a float ratio and not all the whole, which will make slow requests 
problems hard to debug.

HBASE-27766 is not the same as HBASE-27962, though they are somehow similar. 
HBASE-27766 keeps the isolate model of reads and writes, and it still keeps the 
origin scan ability and only steals jobs from gets when scan handlers is iddle. 
As a result, you can not afraid of bad ratio of handlers make the resources 
unbounded, all that should be care is the ratio of reads(scan+get) and writes.

Thanks.

> Introduce an AdaptiveFastPathRWRpcExecutor to make the W/R/S separations fit 
> various workloads 
> ---
>
> Key: HBASE-27962
> URL: https://issues.apache.org/jira/browse/HBASE-27962
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Major
>
> We currently use the FastPathRWQueueRpcExecutor. But the numbers of 
> read/write handlers are fixed, which make the RegionServer performance not so 
> good in our prod env.
> The logic of it is described below:
>  * The basic architecture is same as FastPathRWRpcExecutor.
>  * Introduced a float shared_ratio (0, 1.0), to indicate the ratio of the 
> number of shared handlers. (for example, when we set the ratio to 0.2, if we 
> have 100 handlers, 50 for write, 25 for get, 25 for scan, there will be 10 + 
> 5 + 5 shared handlers and 40 isolated handlers for write, 20 for get and 20 
> for scan).
>  * Shared handler could run all the three kinds of requests.
>  * Shared handler will be shared only when it is idle.
>  * Shared handler is also bounded to a kind of RPCQueue, it will process the 
> requests in that queue first.
> This improvement will improve the resource utility under various workloads 
> and guarantee a level of R/W/S isolation for requests processing at the same 
> time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27963) Replication stuck when switch to new reader

2023-07-05 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27963:
--

 Summary: Replication stuck when switch to new reader
 Key: HBASE-27963
 URL: https://issues.apache.org/jira/browse/HBASE-27963
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


After creating new reader for next WAL, it immediately seek() to the  
currentPositionOfEntry, but this position may be spill over the length of 
current WAL.
{code:java}
WARN  
[RpcServer.default.FPRWQ.Fifo.read.handler=101,queue=1,port=16020.replicationSource.wal-reader.XXX]
 regionserver.ReplicationSourceWALReader: Failed to read stream of replication 
entries
java.io.EOFException: Cannot seek after EOF
        at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1488)
        at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.seekOnFs(ProtobufLogReader.java:495)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.seek(ReaderBase.java:138)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.seek(WALEntryStream.java:399)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:341)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:328)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:347)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:310)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:300)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:176)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.tryAdvanceStreamAndCreateWALBatch(ReplicationSourceWALReader.java:260)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:142)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27927) Skip the length bytes of WALKey and WALEntry cells to avoid replication stuck with WAL compression

2023-06-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27927:
---
Summary: Skip the length bytes of WALKey and WALEntry cells to avoid 
replication stuck with WAL compression  (was: Zero length WALKey and WALEntry 
cells may cause replication with WAL compression stuck)

> Skip the length bytes of WALKey and WALEntry cells to avoid replication stuck 
> with WAL compression
> --
>
> Key: HBASE-27927
> URL: https://issues.apache.org/jira/browse/HBASE-27927
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> We found both the two replication stuck circumstances on our production 
> production clusters, which enables 'hbase.regionserver.wal.enablecompression' 
> and 'hbase.regionserver.wal.value.compression.type'. 
> The zero length of WALKey and WALEntry cells should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27927) Skip zero length bytes of WALKey and WALEntry cells to avoid replication stuck with WAL compression

2023-06-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27927:
---
Summary: Skip zero length bytes of WALKey and WALEntry cells to avoid 
replication stuck with WAL compression  (was: Skip the length bytes of WALKey 
and WALEntry cells to avoid replication stuck with WAL compression)

> Skip zero length bytes of WALKey and WALEntry cells to avoid replication 
> stuck with WAL compression
> ---
>
> Key: HBASE-27927
> URL: https://issues.apache.org/jira/browse/HBASE-27927
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> We found both the two replication stuck circumstances on our production 
> production clusters, which enables 'hbase.regionserver.wal.enablecompression' 
> and 'hbase.regionserver.wal.value.compression.type'. 
> The zero length of WALKey and WALEntry cells should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27927) Zero length WALKey and WALEntry cells may cause replication with WAL compression stuck

2023-06-13 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27927:
--

 Summary: Zero length WALKey and WALEntry cells may cause 
replication with WAL compression stuck
 Key: HBASE-27927
 URL: https://issues.apache.org/jira/browse/HBASE-27927
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


We found both the two replication stuck circumstances on our production 
production clusters, which enables 'hbase.regionserver.wal.enablecompression' 
and 'hbase.regionserver.wal.value.compression.type'. 

The zero length of WALKey and WALEntry cells should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27926) DBB release too early for replication

2023-06-13 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27926:
--

 Summary: DBB release too early for replication
 Key: HBASE-27926
 URL: https://issues.apache.org/jira/browse/HBASE-27926
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


When the RS of destination cluster acts as client to forward the replicated 
entries and encounters exception, the DBB will be released too early by calling 
RpcResponse#done() in NettyRpcServerResponseEncoder.

The coredump and log details are as follows,
{code:java}
Stack: [0x7f92d9e6d000,0x7f92d9f6e000],  sp=0x7f92d9f6be18,  free 
space=1019kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)C  [libc.so.6+0x89db4]  _wordcopy_fwd_dest_aligned+0xd4
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)J 3297  
sun.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V (0 bytes) 
@ 0x7fad7d9aa267 [0x7fad7d9aa200+0x67]j  
org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+36j
  
org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+69j 
 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+39j
  
org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+31J
 15658 C1 
org.apache.hadoop.hbase.CellUtil.cloneQualifier(Lorg/apache/hadoop/hbase/Cell;)[B
 (18 bytes) @ 0x7fad7e9a6c2c [0x7fad7e9a6aa0+0x18c]j  
org.apache.hadoop.hbase.ByteBufferKeyValue.getQualifierArray()[B+1j  
org.apache.hadoop.hbase.client.Mutation.cellToStringMap(Lorg/apache/hadoop/hbase/Cell;)Ljava/util/Map;+12j
  org.apache.hadoop.hbase.client.Mutation.toMap(I)Ljava/util/Map;+189j  
org.apache.hadoop.hbase.client.Operation.toJSON(I)Ljava/lang/String;+2j  
org.apache.hadoop.hbase.client.Operation.toString(I)Ljava/lang/String;+2j  
org.apache.hadoop.hbase.client.Operation.toString()Ljava/lang/String;+2J 8353 
C2 java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; 
(9 bytes) @ 0x7fad7ea0a1bc [0x7fad7ea0a180+0x3c]j  
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageLocationError(Lorg/apache/hadoop/hbase/client/Action;Ljava/lang/Exception;)V+28j
  
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.groupAndSendMultiAction(Ljava/util/List;I)V+163J
 23463 C2 
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.resubmit(Lorg/apache/hadoop/hbase/ServerName;Ljava/util/List;IILjava/lang/Throwable;)V
 (214 bytes) @ 0x7fad80effb54 [0x7fad80eff7a0+0x3b4]J 19097 C2 
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(Lorg/apache/hadoop/hbase/client/MultiAction;Lorg/apache/hadoop/hbase/ServerName;ILjava/lang/Throwable;Z)V
 (312 bytes) @ 0x7fad7ff53370 [0x7fad7ff52fa0+0x3d0]J 20201 C1 
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1600(Lorg/apache/hadoop/hbase/client/AsyncRequestFutureImpl;Lorg/apache/hadoop/hbase/client/MultiAction;Lorg/apache/hadoop/hbase/ServerName;ILjava/lang/Throwable;Z)V
 (12 bytes) @ 0x7fad803f31dc [0x7fad803f3180+0x5c]J 18619 C2 
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run()V
 (677 bytes) @ 0x7fad7f40a8b4 [0x7fad7f409160+0x1754]J 13220 C2 
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
 (225 bytes) @ 0x7fad7f3b3a28 [0x7fad7f3b38a0+0x188]J 10884 C1 
java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
0x7fad7db53c44 [0x7fad7db53b40+0x104]J 7961 C1 java.lang.Thread.run()V 
(17 bytes) @ 0x7fad7d61bbfc [0x7fad7d61bac0+0x13c]v  
~StubRoutines::call_stubStack: [0x7f92d9e6d000,0x7f92d9f6e000],  
sp=0x7f92d9f6be18,  free space=1019kNative frames: (J=compiled Java code, 
j=interpreted, Vv=VM code, C=native code)C  [libc.so.6+0x89db4]  
_wordcopy_fwd_dest_aligned+0xd4
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)J 3297  
sun.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V (0 bytes) 
@ 0x7fad7d9aa267 [0x7fad7d9aa200+0x67]j  
org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+36j
  
org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+69j 
 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+39j
  
org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+31J
 15658 C1 
org.apache.hadoop.hbase.CellUtil.cloneQualifier(Lorg/apache/hadoop/hbase/Cell;)[B
 (18 bytes) @ 0x7fad7e9a6c2c [0x7fad7e9a6aa0+0x18c]j  
org.apache.hadoop.hbase.ByteBufferKeyValue.getQualifierArray()[B+1j  

[jira] [Resolved] (HBASE-27897) ConnectionImplementation#locateRegionInMeta should pause and retry when taking user region lock failed

2023-06-07 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27897.

Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   Resolution: Fixed

Merged to branch-2, branch-2.4 and branch-2.5, thanks [~wchevreuil] for 
reviewing!

> ConnectionImplementation#locateRegionInMeta should pause and retry when 
> taking user region lock failed
> --
>
> Key: HBASE-27897
> URL: https://issues.apache.org/jira/browse/HBASE-27897
> Project: HBase
>  Issue Type: Improvement
>  Components: Client
>Affects Versions: 2.4.17, 2.5.4
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6
>
>
> It just throws exception and skips the pause and retry logic when 
> ConnectionImplementation#takeUserRegionLock fails. In some circumstances, no 
> pause and retry by outer logic will make next 
> ConnectionImplementation#takeUserRegionLock still fails, since all the 
> threads simultaneously grab the lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27712) Remove unused params in region metrics

2023-06-01 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27712:
---
Affects Version/s: 3.0.0-alpha-3

> Remove unused params in region metrics
> --
>
> Key: HBASE-27712
> URL: https://issues.apache.org/jira/browse/HBASE-27712
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-3
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> Histogram metrics in region have been removed in HBASE-17017, but there are 
> some time cost params still left.
> Need to remove them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27712) Remove unused params in region metrics

2023-06-01 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27712.

Fix Version/s: 3.0.0-alpha-4
   Resolution: Fixed

Merged to master, thanks [~tangtianhang] for contributing, and thanks 
[~zhangduo] for reviewing.

> Remove unused params in region metrics
> --
>
> Key: HBASE-27712
> URL: https://issues.apache.org/jira/browse/HBASE-27712
> Project: HBase
>  Issue Type: Bug
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> Histogram metrics in region have been removed in HBASE-17017, but there are 
> some time cost params still left.
> Need to remove them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27897) ConnectionImplementation#locateRegionInMeta should pause and retry when taking user region lock failed

2023-05-29 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27897:
--

 Summary: ConnectionImplementation#locateRegionInMeta should pause 
and retry when taking user region lock failed
 Key: HBASE-27897
 URL: https://issues.apache.org/jira/browse/HBASE-27897
 Project: HBase
  Issue Type: Improvement
  Components: Client
Affects Versions: 2.5.4, 2.4.17
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


It just throws exception and skips the pause and retry logic when 
ConnectionImplementation#takeUserRegionLock fails. In some circumstances, no 
pause and retry by outer logic will make next 
ConnectionImplementation#takeUserRegionLock still fails, since all the threads 
simultaneously grab the lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27683) Should support single call queue mode for RPC handlers while separating by request type

2023-05-29 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727054#comment-17727054
 ] 

Xiaolin Ha commented on HBASE-27683:


YCSB test results is as follows,

common configs are: hbase.regionserver.handler.count=200, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.scan.ratio=0.5

while config for mulit queue is: hbase.ipc.server.callqueue.handler.factor=0.1,

config for single queue(read single queue and write single queue, there are two 
call queues together for the RS) is: 
hbase.ipc.server.callqueue.handler.factor=0.01

!image-2023-05-29-17-16-55-133.png|width=625,height=357!

> Should support single call queue mode for RPC handlers while separating by 
> request type
> ---
>
> Key: HBASE-27683
> URL: https://issues.apache.org/jira/browse/HBASE-27683
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, rpc
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Attachments: image-2023-05-29-17-16-55-133.png
>
>
> Currently we not only seperate call queues by request type, e.g. read, write, 
> scan, but also distinguish queues for handlers by the config 
> `hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
> {code:java}
> Factor to determine the number of call queues.
>   A value of 0 means a single queue shared between all the handlers.
>   A value of 1 means that each handler has its own queue. {code}
> But I think what we want is not only one queue for all the requests, or each 
> handler has its own queue. We also want each request type has its own queue.
> Distinguishing queues in the same type of requests will make some handlers 
> too iddle but some handlers too busy under current balanced/random RPC 
> executor framework. For the extrem case, each handler has its own queue, then 
> if a large request comes for a handler, duing to he executor dispath calls 
> without considering the queue size or the state of the handler, the 
> afterwards coming requests will be queued until the handler complete the 
> large slow request. While other handlers may process small requests quickly, 
> but they can not help or grab calls from the busy queue, they must stay and 
> wait it own queue jobs coming. Then we can see the queue time of some 
> requests are long but there are iddle handlers.
> We can also see these circumstances, that the queue time of calls is too 
> larger than the process time, sometimes twice or more. Restarting the slow RS 
> will make these problems disappear. 
> By using single call queue for each request type, we can fully use the 
> handler resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27683) Should support single call queue mode for RPC handlers while separating by request type

2023-05-29 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27683:
---
Attachment: image-2023-05-29-17-16-55-133.png

> Should support single call queue mode for RPC handlers while separating by 
> request type
> ---
>
> Key: HBASE-27683
> URL: https://issues.apache.org/jira/browse/HBASE-27683
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, rpc
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Attachments: image-2023-05-29-17-16-55-133.png
>
>
> Currently we not only seperate call queues by request type, e.g. read, write, 
> scan, but also distinguish queues for handlers by the config 
> `hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
> {code:java}
> Factor to determine the number of call queues.
>   A value of 0 means a single queue shared between all the handlers.
>   A value of 1 means that each handler has its own queue. {code}
> But I think what we want is not only one queue for all the requests, or each 
> handler has its own queue. We also want each request type has its own queue.
> Distinguishing queues in the same type of requests will make some handlers 
> too iddle but some handlers too busy under current balanced/random RPC 
> executor framework. For the extrem case, each handler has its own queue, then 
> if a large request comes for a handler, duing to he executor dispath calls 
> without considering the queue size or the state of the handler, the 
> afterwards coming requests will be queued until the handler complete the 
> large slow request. While other handlers may process small requests quickly, 
> but they can not help or grab calls from the busy queue, they must stay and 
> wait it own queue jobs coming. Then we can see the queue time of some 
> requests are long but there are iddle handlers.
> We can also see these circumstances, that the queue time of calls is too 
> larger than the process time, sometimes twice or more. Restarting the slow RS 
> will make these problems disappear. 
> By using single call queue for each request type, we can fully use the 
> handler resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27881) The sleep time in checkQuota of replication WAL reader should be controlled independently

2023-05-23 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27881:
---
Description: In theory the sleep time when checking quota failed in 
replication WAL reader should match the consume ability of the memory. But at 
the very least we should isolate the configuration here with the sleep time for 
common circumstances when replicating, e.g. sleep before reading from the head 
of the WAL, to avoid a little bit larger but reasonable sleep time(e.g. 3s) can 
make the consume speed always blocked by checking quota or cannot recover the 
consume speed in a very long time(except there exists long time of low source 
WAL production peak).  (was: In theory the sleep time when checking quota 
failed in replication WAL reader should match the consume ability of the 
memory. But at the very least we should isolate the configuration here with the 
sleep time for common circumstances when replicating, e.g. sleep before reading 
from the head of the WAL, to avoid a little bit larger but reasonable sleep 
time(e.g. 3s) making consume speed always blocked by checking quota.)

> The sleep time in checkQuota of replication WAL reader should be controlled 
> independently 
> --
>
> Key: HBASE-27881
> URL: https://issues.apache.org/jira/browse/HBASE-27881
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 3.0.0-alpha-3, 2.5.4
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> In theory the sleep time when checking quota failed in replication WAL reader 
> should match the consume ability of the memory. But at the very least we 
> should isolate the configuration here with the sleep time for common 
> circumstances when replicating, e.g. sleep before reading from the head of 
> the WAL, to avoid a little bit larger but reasonable sleep time(e.g. 3s) can 
> make the consume speed always blocked by checking quota or cannot recover the 
> consume speed in a very long time(except there exists long time of low source 
> WAL production peak).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27881) The sleep time in checkQuota of replication WAL reader should be controlled independently

2023-05-23 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27881:
---
Description: In theory the sleep time when checking quota failed in 
replication WAL reader should match the consume ability of the memory. But at 
the very least we should isolate the configuration here with the sleep time for 
common circumstances when replicating, e.g. sleep before reading from the head 
of the WAL, to avoid a little bit larger but reasonable sleep time(e.g. 3s) 
making consume speed always blocked by checking quota.  (was: In theory the 
sleep time when checking quota failed in replication WAL reader should match 
the consume ability of the memory. But at the very least we should isolate the 
configuration here with the sleep time for common circumstances when 
replicating, e.g. sleep before reading from the head of the WAL, to avoid a 
little bit bigger sleep time(e.g. 3s) making consume speed always blocked by 
checking quota.)

> The sleep time in checkQuota of replication WAL reader should be controlled 
> independently 
> --
>
> Key: HBASE-27881
> URL: https://issues.apache.org/jira/browse/HBASE-27881
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 3.0.0-alpha-3, 2.5.4
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> In theory the sleep time when checking quota failed in replication WAL reader 
> should match the consume ability of the memory. But at the very least we 
> should isolate the configuration here with the sleep time for common 
> circumstances when replicating, e.g. sleep before reading from the head of 
> the WAL, to avoid a little bit larger but reasonable sleep time(e.g. 3s) 
> making consume speed always blocked by checking quota.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27882) Avoid always reinit the decompressor in the hot read path

2023-05-23 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27882:
--

 Summary: Avoid always reinit the decompressor in the hot read path
 Key: HBASE-27882
 URL: https://issues.apache.org/jira/browse/HBASE-27882
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 2.5.4, 3.0.0-alpha-3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha
 Attachments: image-2023-05-24-11-06-48-569.png

When seting "hbase.block.data.cachecompressed=true", the cached blocks are 
decompressed when reading. But we are using pooled decompressors here, which 
means the decompressor configs should be refreshed as a prepare job before each 
decompressing, see the line here 

[https://github.com/apache/hbase/blob/22526a6339afa230679bcf08fa1c917b04cdac6d/hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDefaultDecodingContext.java#L99]

I have pointed out the lock of Configuration.get problem in HBASE-27672, it 
should be avoid when reiniting in the hot read path either. 

!image-2023-05-24-11-06-48-569.png|width=668,height=286!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27672) Read RPC threads may BLOCKED at the Configuration.get when using java compression

2023-05-23 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27672:
---
Description: 
As in the jstack info, we can see some RPC threads or compaction threads 
BLOCKED,

!image-2023-02-27-19-22-52-704.png|width=976,height=355!

  was:
As in the jstack info, we can see some RPC threads or compaction threads BLOCK,

!image-2023-02-27-19-22-52-704.png|width=976,height=355!


> Read RPC threads may BLOCKED at the Configuration.get when using java 
> compression
> -
>
> Key: HBASE-27672
> URL: https://issues.apache.org/jira/browse/HBASE-27672
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: image-2023-02-27-19-22-52-704.png
>
>
> As in the jstack info, we can see some RPC threads or compaction threads 
> BLOCKED,
> !image-2023-02-27-19-22-52-704.png|width=976,height=355!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27881) The sleep time in checkQuota of replication WAL reader should be controlled independently

2023-05-23 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27881:
--

 Summary: The sleep time in checkQuota of replication WAL reader 
should be controlled independently 
 Key: HBASE-27881
 URL: https://issues.apache.org/jira/browse/HBASE-27881
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 2.5.4, 3.0.0-alpha-3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


In theory the sleep time when checking quota failed in replication WAL reader 
should match the consume ability of the memory. But at the very least we should 
isolate the configuration here with the sleep time for common circumstances 
when replicating, e.g. sleep before reading from the head of the WAL, to avoid 
a little bit bigger sleep time(e.g. 3s) making consume speed always blocked by 
checking quota.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27766) Support steal job queue mode for read RPC queues of RWQueueRpcExecutor

2023-03-29 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27766:
---
Description: 
Currently, the RPC queues are distinguished by request type, under most 
circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
while reads queues are always divided by get requests and scan requests. The 
reason why we isolate the scan requests and get requests is that we do not want 
large scans block small gets.

Since the handler resources for a regionserver is limited and we can't 
dynamicly change the handler ratio by the ratio of requests. We should both 
keep large scan and the small gets be isolated, and let the idle handlers for 
the samller ratio scans to handle some gets when the gets handlers are busy.

This steal queue idea can also be used in other circumstances, e.g. idle read 
handler steal jobs from write queus. 

  was:
Currently, the RPC queues are distinguished by request type, under most 
circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
while reads queues are always divided by get requests and scan requests. The 
reason why we isolate the scan requests and get requests is that we do not want 
large scans block small gets.

Since the handler resources for a regionserver is limited and we can't 
dynamicly change the handler ratio by the ratio of requests. We should both 
keep large scan and the small gets be isolated, and let the idle handlers for 
the samller ratio scans to handle some gets when the gets handlers are busy.

This steal queue idea can also used in other circumstances, e.g. idle read 
handler steal jobs from write queus. 


> Support steal job queue mode for read RPC queues of RWQueueRpcExecutor
> --
>
> Key: HBASE-27766
> URL: https://issues.apache.org/jira/browse/HBASE-27766
> Project: HBase
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> Currently, the RPC queues are distinguished by request type, under most 
> circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
> while reads queues are always divided by get requests and scan requests. The 
> reason why we isolate the scan requests and get requests is that we do not 
> want large scans block small gets.
> Since the handler resources for a regionserver is limited and we can't 
> dynamicly change the handler ratio by the ratio of requests. We should both 
> keep large scan and the small gets be isolated, and let the idle handlers for 
> the samller ratio scans to handle some gets when the gets handlers are busy.
> This steal queue idea can also be used in other circumstances, e.g. idle read 
> handler steal jobs from write queus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27766) Support steal job queue mode for read RPC queues of RWQueueRpcExecutor

2023-03-29 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27766:
---
Description: 
Currently, the RPC queues are distinguished by request type, under most 
circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
while reads queues are always divided by get requests and scan requests. The 
reason why we isolate the scan requests and get requests is that we do not want 
large scans block small gets.

Since the handler resources for a regionserver is limited and we can't 
dynamicly change the handler ratio by the ratio of requests. We should both 
keep large scan and the small gets be isolated, and let the idle handlers for 
the samller ratio scans to handle some gets when the gets handlers are busy.

This steal queue idea can also used in other circumstances, e.g. idle read 
handler steal jobs from write queus. 

  was:
Currently, the RPC queues are distinguished by request type, under most 
circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
while reads queues are always divided by get requests and scan requests. The 
reason why we isolate the scan requests and get requests is that we do not want 
large scans block small gets.

Since the handler resources for a regionserver is limited and we can't 
dynamicly change the handler ratio by the ratio of requests. We should both 
keep large scan the small gets be isolated, and let the idle handlers for the 
samller ratio scans to handle some gets when the gets handlers are busy.

This steal queue idea can also used in other circumstances, e.g. idle read 
handler steal jobs from write queus. 


> Support steal job queue mode for read RPC queues of RWQueueRpcExecutor
> --
>
> Key: HBASE-27766
> URL: https://issues.apache.org/jira/browse/HBASE-27766
> Project: HBase
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> Currently, the RPC queues are distinguished by request type, under most 
> circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
> while reads queues are always divided by get requests and scan requests. The 
> reason why we isolate the scan requests and get requests is that we do not 
> want large scans block small gets.
> Since the handler resources for a regionserver is limited and we can't 
> dynamicly change the handler ratio by the ratio of requests. We should both 
> keep large scan and the small gets be isolated, and let the idle handlers for 
> the samller ratio scans to handle some gets when the gets handlers are busy.
> This steal queue idea can also used in other circumstances, e.g. idle read 
> handler steal jobs from write queus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27766) Support steal job queue mode for read RPC queues of RWQueueRpcExecutor

2023-03-29 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27766:
--

 Summary: Support steal job queue mode for read RPC queues of 
RWQueueRpcExecutor
 Key: HBASE-27766
 URL: https://issues.apache.org/jira/browse/HBASE-27766
 Project: HBase
  Issue Type: Improvement
  Components: rpc
Affects Versions: 2.5.3, 3.0.0-alpha-3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


Currently, the RPC queues are distinguished by request type, under most 
circumstances of RWQueueRpcExecutor, there are write queues and read queues, 
while reads queues are always divided by get requests and scan requests. The 
reason why we isolate the scan requests and get requests is that we do not want 
large scans block small gets.

Since the handler resources for a regionserver is limited and we can't 
dynamicly change the handler ratio by the ratio of requests. We should both 
keep large scan the small gets be isolated, and let the idle handlers for the 
samller ratio scans to handle some gets when the gets handlers are busy.

This steal queue idea can also used in other circumstances, e.g. idle read 
handler steal jobs from write queus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27333) Abort RS when the hostname is different from master seen

2023-03-29 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27333:
---
Affects Version/s: 2.5.3
   (was: 2.4.13)

> Abort RS when the hostname is different from master seen
> 
>
> Key: HBASE-27333
> URL: https://issues.apache.org/jira/browse/HBASE-27333
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> For HRegionServer#handleReportForDutyResponse, when the hostname is different 
> from the regionserver and master side, both the two conditions should abort 
> RS.
> {code:java}
> if (
>   !StringUtils.isBlank(useThisHostnameInstead)
> && !hostnameFromMasterPOV.equals(useThisHostnameInstead)
> ) {
>   String msg = "Master passed us a different hostname to use; was="
> + this.useThisHostnameInstead + ", but now=" + hostnameFromMasterPOV;
>   LOG.error(msg);
>   throw new IOException(msg);
> }
> if (
>   StringUtils.isBlank(useThisHostnameInstead)
> && 
> !hostnameFromMasterPOV.equals(rpcServices.getSocketAddress().getHostName())
> ) {
>   String msg = "Master passed us a different hostname to use; was="
> + rpcServices.getSocketAddress().getHostName() + ", but now=" + 
> hostnameFromMasterPOV;
>   LOG.error(msg);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27333) Abort RS when the hostname is different from master seen

2023-03-29 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27333:
---
Fix Version/s: (was: 2.5.4)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Merged to master and branch-2, thanks [~apurtell] for reviewing.

> Abort RS when the hostname is different from master seen
> 
>
> Key: HBASE-27333
> URL: https://issues.apache.org/jira/browse/HBASE-27333
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-3, 2.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> For HRegionServer#handleReportForDutyResponse, when the hostname is different 
> from the regionserver and master side, both the two conditions should abort 
> RS.
> {code:java}
> if (
>   !StringUtils.isBlank(useThisHostnameInstead)
> && !hostnameFromMasterPOV.equals(useThisHostnameInstead)
> ) {
>   String msg = "Master passed us a different hostname to use; was="
> + this.useThisHostnameInstead + ", but now=" + hostnameFromMasterPOV;
>   LOG.error(msg);
>   throw new IOException(msg);
> }
> if (
>   StringUtils.isBlank(useThisHostnameInstead)
> && 
> !hostnameFromMasterPOV.equals(rpcServices.getSocketAddress().getHostName())
> ) {
>   String msg = "Master passed us a different hostname to use; was="
> + rpcServices.getSocketAddress().getHostName() + ", but now=" + 
> hostnameFromMasterPOV;
>   LOG.error(msg);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27322) The process and queue time metrics of read and write calls on the server side should be separated

2023-03-21 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27322:
---
Summary: The process and queue time metrics of read and write calls on the 
server side should be separated  (was: The processing call and dequeue time 
metrics from the regionserver side should be separated)

> The process and queue time metrics of read and write calls on the server side 
> should be separated
> -
>
> Key: HBASE-27322
> URL: https://issues.apache.org/jira/browse/HBASE-27322
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 3.0.0-alpha-3, 2.4.13
>Reporter: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> The process time and queue time vary widely between read and write requests. 
> We should seperate them to let the metrics be more accurate for requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27322) The process and queue time metrics of read and write calls on the server side should be separated

2023-03-21 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27322:
---
Affects Version/s: 2.5.3
   (was: 2.4.13)

> The process and queue time metrics of read and write calls on the server side 
> should be separated
> -
>
> Key: HBASE-27322
> URL: https://issues.apache.org/jira/browse/HBASE-27322
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> The process time and queue time vary widely between read and write requests. 
> We should seperate them to let the metrics be more accurate for requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-27322) The process and queue time metrics of read and write calls on the server side should be separated

2023-03-21 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-27322:
--

Assignee: Xiaolin Ha

> The process and queue time metrics of read and write calls on the server side 
> should be separated
> -
>
> Key: HBASE-27322
> URL: https://issues.apache.org/jira/browse/HBASE-27322
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 3.0.0-alpha-3, 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> The process time and queue time vary widely between read and write requests. 
> We should seperate them to let the metrics be more accurate for requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-21 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27676.

Resolution: Fixed

Merged to branch-2+ and master, thanks [~bbeaudreault] for reviewing.

> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
> 0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-21 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Fix Version/s: 2.6.0
   3.0.0-alpha-4
   2.4.17
   2.5.4

> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
> 0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27717) Add rsgroup name for dead region servers on master UI

2023-03-20 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703041#comment-17703041
 ] 

Xiaolin Ha commented on HBASE-27717:


[~zhangduo] I can't assign this issue to [~xieyupei] , could you help to take a 
look?

> Add rsgroup name for dead region servers on master UI
> -
>
> Key: HBASE-27717
> URL: https://issues.apache.org/jira/browse/HBASE-27717
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Priority: Minor
>
> We also want to known the rsgroup name of dead region servers, which are 
> showed in the `Dead Region Servers` of master UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27717) Add rsgroup name for dead region servers on master UI

2023-03-20 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703022#comment-17703022
 ] 

Xiaolin Ha commented on HBASE-27717:


[~xieyupei] OK

> Add rsgroup name for dead region servers on master UI
> -
>
> Key: HBASE-27717
> URL: https://issues.apache.org/jira/browse/HBASE-27717
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Priority: Minor
>
> We also want to known the rsgroup name of dead region servers, which are 
> showed in the `Dead Region Servers` of master UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-27717) Add rsgroup name for dead region servers on master UI

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-27717:
--

Assignee: (was: Xiaolin Ha)

> Add rsgroup name for dead region servers on master UI
> -
>
> Key: HBASE-27717
> URL: https://issues.apache.org/jira/browse/HBASE-27717
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Priority: Minor
>
> We also want to known the rsgroup name of dead region servers, which are 
> showed in the `Dead Region Servers` of master UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
{*}{*}0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
> 0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150*0.5{_}*{_}0.1=7 scan handlers, but there are 150*0.1{_}*{_}0.5*0.1=0 scan 
RPC queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
> 150*0.5{_}*{_}0.1=7 scan handlers, but there are 150*0.1{_}*{_}0.5*0.1=0 scan 
> RPC queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
{*}{*}0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150*0.5{_}*{_}0.1=7 scan handlers, but there are 150*0.1{_}*{_}0.5*0.1=0 scan 
RPC queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 150 * 0.5 * 
> {*}{*}0.1 = 7 scan handlers, but there are 150 * 0.1 * 0.5 * 0.1 = 0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
queues.

When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.

We can see from the codes,
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
queues.


When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
> 150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> We can see from the codes,
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> when readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
This issue is try to avoid NO scan queues for some scan handlers.
For example, if we set hbase.regionserver.handler.count=150, 
hbase.ipc.server.callqueue.scan.ratio=0.1, 
hbase.ipc.server.callqueue.read.ratio=0.5, 
hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
queues.


When there are no scan rpc queues, all the scan and get requests will be 
dispatched to the read rpc queues, while we we thought they had been dealt with 
separately, since the scan handler count is not 0. When there are not enough 
handlers for large scan requests under this circumstance, the small get 
requests will be blocked in the rpc queues.
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
> 150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Priority: Major  (was: Minor)

> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> This issue is try to avoid NO scan queues for some scan handlers.
> For example, if we set hbase.regionserver.handler.count=150, 
> hbase.ipc.server.callqueue.scan.ratio=0.1, 
> hbase.ipc.server.callqueue.read.ratio=0.5, 
> hbase.ipc.server.callqueue.handler.factor=0.1, then there will be 
> 150{_}0.5{_}0.1=7 scan handlers, but there are 150{_}0.1{_}0.5*0.1=0 scan RPC 
> queues.
> When there are no scan rpc queues, all the scan and get requests will be 
> dispatched to the read rpc queues, while we we thought they had been dealt 
> with separately, since the scan handler count is not 0. When there are not 
> enough handlers for large scan requests under this circumstance, the small 
> get requests will be blocked in the rpc queues.
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27646) Should not use pread when prefetching in HFilePreadReader

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27646.

Resolution: Fixed

Merged to branch-2+ and master, thanks [~bbeaudreault] for reviewing.

> Should not use pread when prefetching in HFilePreadReader
> -
>
> Key: HBASE-27646
> URL: https://issues.apache.org/jira/browse/HBASE-27646
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> While prefetchOnOpen will read althrough the hfile, we should use stream read 
> for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27646) Should not use pread when prefetching in HFilePreadReader

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27646:
---
Fix Version/s: 2.6.0
   3.0.0-alpha-4
   2.4.17
   2.5.4

> Should not use pread when prefetching in HFilePreadReader
> -
>
> Key: HBASE-27646
> URL: https://issues.apache.org/jira/browse/HBASE-27646
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> While prefetchOnOpen will read althrough the hfile, we should use stream read 
> for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27646) Should not use pread when prefetching in HFilePreadReader

2023-03-20 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27646:
---
Affects Version/s: 2.5.3

> Should not use pread when prefetching in HFilePreadReader
> -
>
> Key: HBASE-27646
> URL: https://issues.apache.org/jira/browse/HBASE-27646
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> While prefetchOnOpen will read althrough the hfile, we should use stream read 
> for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-27717) Add rsgroup name for dead region servers on master UI

2023-03-17 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-27717:
--

Assignee: Xiaolin Ha

> Add rsgroup name for dead region servers on master UI
> -
>
> Key: HBASE-27717
> URL: https://issues.apache.org/jira/browse/HBASE-27717
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> We also want to known the rsgroup name of dead region servers, which are 
> showed in the `Dead Region Servers` of master UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27717) Add rsgroup name for dead region servers on master UI

2023-03-14 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27717:
--

 Summary: Add rsgroup name for dead region servers on master UI
 Key: HBASE-27717
 URL: https://issues.apache.org/jira/browse/HBASE-27717
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 2.5.3
Reporter: Xiaolin Ha


We also want to known the rsgroup name of dead region servers, which are showed 
in the `Dead Region Servers` of master UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27615) Add group of regionServer on Master webUI

2023-03-14 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27615.

Fix Version/s: 2.6.0
   3.0.0-alpha-4
   Resolution: Fixed

Merged to branch-2 and master, thanks [~tangtianhang] for contributing.

> Add group of regionServer on Master webUI
> -
>
> Key: HBASE-27615
> URL: https://issues.apache.org/jira/browse/HBASE-27615
> Project: HBase
>  Issue Type: Improvement
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
> Attachments: image-2023-02-06-12-04-03-503.png
>
>
> We do have a RSGroupList on webUI now, but it is still a little inconvenient 
> if I just want to know which group a specific regionServer belongs to.
> So add this info on webUI:
> !image-2023-02-06-12-04-03-503.png|width=889,height=174!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27615) Add group of regionServer on Master webUI

2023-03-09 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698276#comment-17698276
 ] 

Xiaolin Ha commented on HBASE-27615:


Nice work. Please propose PRs for branch-2 and branch-2.4+.

> Add group of regionServer on Master webUI
> -
>
> Key: HBASE-27615
> URL: https://issues.apache.org/jira/browse/HBASE-27615
> Project: HBase
>  Issue Type: Improvement
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Attachments: image-2023-02-06-12-04-03-503.png
>
>
> We do have a RSGroupList on webUI now, but it is still a little inconvenient 
> if I just want to know which group a specific regionServer belongs to.
> So add this info on webUI:
> !image-2023-02-06-12-04-03-503.png|width=889,height=174!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

2023-03-06 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696708#comment-17696708
 ] 

Xiaolin Ha edited comment on HBASE-25709 at 3/6/23 10:05 AM:
-

Merged to branch-2 and master, thanks [~bbeaudreault] for reviewing, and thanks 
all the feedbacks!


was (Author: xiaolin ha):
Merged to branch-2.5+, thanks [~bbeaudreault] for reviewing, and thanks all the 
feedbacks!

> Close region may stuck when region is compacting and skipped most cells read
> 
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

2023-03-06 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-25709:
---
Fix Version/s: (was: 2.5.4)

> Close region may stuck when region is compacting and skipped most cells read
> 
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

2023-03-05 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-25709.

Release Note:   (was: Both compacting scanners and user scanners should 
return promptly, when  there are many skipped cells.)
  Resolution: Fixed

Merged to branch-2.5+, thanks [~bbeaudreault] for reviewing, and thanks all the 
feedbacks!

> Close region may stuck when region is compacting and skipped most cells read
> 
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

2023-03-05 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-25709:
---
Fix Version/s: (was: 2.4.11)

> Close region may stuck when region is compacting and skipped most cells read
> 
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27672) Read RPC threads may BLOCKED at the Configuration.get when using java compression

2023-03-05 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27672.

Resolution: Fixed

Merged to branch-2.5+, thanks [~bbeaudreault] for reviewing!

> Read RPC threads may BLOCKED at the Configuration.get when using java 
> compression
> -
>
> Key: HBASE-27672
> URL: https://issues.apache.org/jira/browse/HBASE-27672
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: image-2023-02-27-19-22-52-704.png
>
>
> As in the jstack info, we can see some RPC threads or compaction threads 
> BLOCK,
> !image-2023-02-27-19-22-52-704.png|width=976,height=355!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27672) Read RPC threads may BLOCKED at the Configuration.get when using java compression

2023-03-05 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27672:
---
Fix Version/s: 2.6.0
   3.0.0-alpha-4
   2.5.4

> Read RPC threads may BLOCKED at the Configuration.get when using java 
> compression
> -
>
> Key: HBASE-27672
> URL: https://issues.apache.org/jira/browse/HBASE-27672
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: image-2023-02-27-19-22-52-704.png
>
>
> As in the jstack info, we can see some RPC threads or compaction threads 
> BLOCK,
> !image-2023-02-27-19-22-52-704.png|width=976,height=355!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27683) Should support single call queue mode for RPC handlers while separating by request type

2023-03-02 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27683:
---
Description: 
Currently we not only seperate call queues by request type, e.g. read, write, 
scan, but also distinguish queues for handlers by the config 
`hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
{code:java}
Factor to determine the number of call queues.
  A value of 0 means a single queue shared between all the handlers.
  A value of 1 means that each handler has its own queue. {code}
But I think what we want is not only one queue for all the requests, or each 
handler has its own queue. We also want each request type has its own queue.

Distinguishing queues in the same type of requests will make some handlers too 
iddle but some handlers too busy under current balanced/random RPC executor 
framework. For the extrem case, each handler has its own queue, then if a large 
request comes for a handler, duing to he executor dispath calls without 
considering the queue size or the state of the handler, the afterwards coming 
requests will be queued until the handler complete the large slow request. 
While other handlers may process small requests quickly, but they can not help 
or grab calls from the busy queue, they must stay and wait it own queue jobs 
coming. Then we can see the queue time of some requests are long but there are 
iddle handlers.

We can also see these circumstances, that the queue time of calls is too larger 
than the process time, sometimes twice or more. Restarting the slow RS will 
make these problems disappear. 

By using single call queue for each request type, we can fully use the handler 
resources.

  was:
Currently we not only seperate call queues by request type, e.g. read, write, 
scan, but also distinguish queues for handlers by the config 
`hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
{code:java}
Factor to determine the number of call queues.
  A value of 0 means a single queue shared between all the handlers.
  A value of 1 means that each handler has its own queue. {code}
But I think what we want is not only one queue for all the requests, or each 
handler has its own queue. We also want each request type has one queue.

Distinguish queues in the same type of requests will make some handlers too 
iddle but some handlers too busy under current balanced/random RPC executor 
framework. For the extrem case, each handler has its own queue, then if a large 
request comes for a handler, duing to he executor dispath calls without 
considering the queue size or the state of the handler, the afterwards coming 
requests will be queued until the handler complete the large slow request. 
While other handlers may process small requests quickly, but they can not help 
or grab calls from the busy queue, they must stay and wait it own queue jobs 
coming. Then we can see the queue time of some requests are long but there are 
iddle handlers.

We can also see these circumstances, that the queue time of calls is too larger 
than the process time, sometimes twice or more. Restarting the slow RS will 
make these problems disappear. 

By using single call queue for each request type, we can fully use the handler 
resources.


> Should support single call queue mode for RPC handlers while separating by 
> request type
> ---
>
> Key: HBASE-27683
> URL: https://issues.apache.org/jira/browse/HBASE-27683
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance, rpc
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> Currently we not only seperate call queues by request type, e.g. read, write, 
> scan, but also distinguish queues for handlers by the config 
> `hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
> {code:java}
> Factor to determine the number of call queues.
>   A value of 0 means a single queue shared between all the handlers.
>   A value of 1 means that each handler has its own queue. {code}
> But I think what we want is not only one queue for all the requests, or each 
> handler has its own queue. We also want each request type has its own queue.
> Distinguishing queues in the same type of requests will make some handlers 
> too iddle but some handlers too busy under current balanced/random RPC 
> executor framework. For the extrem case, each handler has its own queue, then 
> if a large request comes for a handler, duing to he executor dispath calls 
> without considering the queue size or the state of the handler, the 
> afterwards coming requests will be queued until the handler complete the 
> large slow request. While other handlers may process small requests 

[jira] [Created] (HBASE-27683) Should support single call queue mode for RPC handlers while separating by request type

2023-03-02 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27683:
--

 Summary: Should support single call queue mode for RPC handlers 
while separating by request type
 Key: HBASE-27683
 URL: https://issues.apache.org/jira/browse/HBASE-27683
 Project: HBase
  Issue Type: Improvement
  Components: Performance, rpc
Affects Versions: 2.5.3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


Currently we not only seperate call queues by request type, e.g. read, write, 
scan, but also distinguish queues for handlers by the config 
`hbase.ipc.server.callqueue.handler.factor`, whose description is as follows,
{code:java}
Factor to determine the number of call queues.
  A value of 0 means a single queue shared between all the handlers.
  A value of 1 means that each handler has its own queue. {code}
But I think what we want is not only one queue for all the requests, or each 
handler has its own queue. We also want each request type has one queue.

Distinguish queues in the same type of requests will make some handlers too 
iddle but some handlers too busy under current balanced/random RPC executor 
framework. For the extrem case, each handler has its own queue, then if a large 
request comes for a handler, duing to he executor dispath calls without 
considering the queue size or the state of the handler, the afterwards coming 
requests will be queued until the handler complete the large slow request. 
While other handlers may process small requests quickly, but they can not help 
or grab calls from the busy queue, they must stay and wait it own queue jobs 
coming. Then we can see the queue time of some requests are long but there are 
iddle handlers.

We can also see these circumstances, that the queue time of calls is too larger 
than the process time, sometimes twice or more. Restarting the slow RS will 
make these problems disappear. 

By using single call queue for each request type, we can fully use the handler 
resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-03-02 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27676:
---
Description: 
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some iddle scan handlers with NO scan queues.

  was:
{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some active scan handlers with NO scan queues.


> Scan handlers in the RPC executor should match at least one scan queues
> ---
>
> Key: HBASE-27676
> URL: https://issues.apache.org/jira/browse/HBASE-27676
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
>
> {code:java}
> int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
> int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * 
> callqScanShare));
> if ((readQueues - scanQueues) > 0) {
>   readQueues -= scanQueues;
>   readHandlers -= scanHandlers;
> } else {
>   scanQueues = 0;
>   scanHandlers = 0;
> } {code}
> When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
> there will be some iddle scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27682) Metrics for RPC handlers should provide more info, e.g. avg,max

2023-03-02 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27682:
---
Summary: Metrics for RPC handlers should provide more info, e.g. avg,max  
(was: Metrics for RPC handlers should provide more info, e.g. avg/max)

> Metrics for RPC handlers should provide more info, e.g. avg,max
> ---
>
> Key: HBASE-27682
> URL: https://issues.apache.org/jira/browse/HBASE-27682
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics, rpc
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Priority: Minor
>
> We need the distribution info of the handler count to address problems of 
> performance, e.g. the maximum value, the average value of the handlers. 
> It is extremly importance for system of low latency requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27682) Metrics for RPC handlers should provide more info, e.g. avg/max

2023-03-02 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27682:
---
Summary: Metrics for RPC handlers should provide more info, e.g. avg/max  
(was: Metrics for RPC handlers should provide the distribution info)

> Metrics for RPC handlers should provide more info, e.g. avg/max
> ---
>
> Key: HBASE-27682
> URL: https://issues.apache.org/jira/browse/HBASE-27682
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics, rpc
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Priority: Minor
>
> We need the distribution info of the handler count to address problems of 
> performance, e.g. the maximum value, the average value of the handlers. 
> It is extremly importance for system of low latency requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27682) Metrics for RPC handlers should provide the distribution info

2023-03-02 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27682:
--

 Summary: Metrics for RPC handlers should provide the distribution 
info
 Key: HBASE-27682
 URL: https://issues.apache.org/jira/browse/HBASE-27682
 Project: HBase
  Issue Type: Improvement
  Components: metrics, rpc
Affects Versions: 2.5.3
Reporter: Xiaolin Ha


We need the distribution info of the handler count to address problems of 
performance, e.g. the maximum value, the average value of the handlers. 

It is extremly importance for system of low latency requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27458) Use ReadWriteLock for region scanner readpoint map

2023-03-01 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-27458.

Resolution: Fixed

Merged to master and branch-2+, thanks [~frostruan] for contributing, and 
thanks [~zhangduo] for reviewing.

> Use ReadWriteLock for region scanner readpoint map 
> ---
>
> Key: HBASE-27458
> URL: https://issues.apache.org/jira/browse/HBASE-27458
> Project: HBase
>  Issue Type: Improvement
>  Components: Scanners
>Affects Versions: 3.0.0-alpha-3
>Reporter: ruanhui
>Assignee: ruanhui
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
> Attachments: jstack-2.png
>
>
> Currently we manage the concurrency between the RegionScanner and 
> getSmallestReadPoint by synchronizing on the scannerReadPoints object. In our 
> production, we find that many read threads are blocked by this when we have a 
> heavy read load. 
> we need to get smallest read point when 
> a. flush a memstore 
> b. compact memstore/storefile 
> c. do delta operation like increment/append
> Usually the frequency of these operations is much less than read requests. 
> It's a little expensive to use an exclusive lock here because for region 
> scanners, what it need to do is just calcaulating readpoint and putting the 
> readpoint in the scanner readpoint map, which is thread-safe. Multiple read 
> threads can do this in parallel without synchronization.
> Based on the above consideration, maybe we can replace the synchronized lock 
> with readwrite lock. It will help improve the read performance if the 
> bottleneck is on the synchronization here.
> !jstack.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27458) Use ReadWriteLock for region scanner readpoint map

2023-03-01 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27458:
---
Fix Version/s: 2.6.0
   2.4.17
   2.5.4

> Use ReadWriteLock for region scanner readpoint map 
> ---
>
> Key: HBASE-27458
> URL: https://issues.apache.org/jira/browse/HBASE-27458
> Project: HBase
>  Issue Type: Improvement
>  Components: Scanners
>Affects Versions: 3.0.0-alpha-3
>Reporter: ruanhui
>Assignee: ruanhui
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
> Attachments: jstack-2.png
>
>
> Currently we manage the concurrency between the RegionScanner and 
> getSmallestReadPoint by synchronizing on the scannerReadPoints object. In our 
> production, we find that many read threads are blocked by this when we have a 
> heavy read load. 
> we need to get smallest read point when 
> a. flush a memstore 
> b. compact memstore/storefile 
> c. do delta operation like increment/append
> Usually the frequency of these operations is much less than read requests. 
> It's a little expensive to use an exclusive lock here because for region 
> scanners, what it need to do is just calcaulating readpoint and putting the 
> readpoint in the scanner readpoint map, which is thread-safe. Multiple read 
> threads can do this in parallel without synchronization.
> Based on the above consideration, maybe we can replace the synchronized lock 
> with readwrite lock. It will help improve the read performance if the 
> bottleneck is on the synchronization here.
> !jstack.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27676) Scan handlers in the RPC executor should match at least one scan queues

2023-02-28 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27676:
--

 Summary: Scan handlers in the RPC executor should match at least 
one scan queues
 Key: HBASE-27676
 URL: https://issues.apache.org/jira/browse/HBASE-27676
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


{code:java}
int scanQueues = Math.max(0, (int) Math.floor(readQueues * callqScanShare));
int scanHandlers = Math.max(0, (int) Math.floor(readHandlers * callqScanShare));

if ((readQueues - scanQueues) > 0) {
  readQueues -= scanQueues;
  readHandlers -= scanHandlers;
} else {
  scanQueues = 0;
  scanHandlers = 0;
} {code}
When readQueues * callqScanShare < 1 but readHandlers * callqScanShare > 1, 
there will be some active scan handlers with NO scan queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27672) Read RPC threads may BLOCKED at the Configuration.get when using java compression

2023-02-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27672:
---
Summary: Read RPC threads may BLOCKED at the Configuration.get when using 
java compression  (was: Read RPC threads may BLOCK at the Configuration.get 
when using java compression)

> Read RPC threads may BLOCKED at the Configuration.get when using java 
> compression
> -
>
> Key: HBASE-27672
> URL: https://issues.apache.org/jira/browse/HBASE-27672
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Attachments: image-2023-02-27-19-22-52-704.png
>
>
> As in the jstack info, we can see some RPC threads or compaction threads 
> BLOCK,
> !image-2023-02-27-19-22-52-704.png|width=976,height=355!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27647) Fix the cached block differ (which should not have happened) issue with cacheDataOnWrite on

2023-02-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27647:
---
Description: 
When caching the block by reading from hfile, the block cached will contains a 
header with the next block info.

But when caching the block by writing when cacheDataOnWrite is on, the block 
cached without the header of the next block info.

Then when comparing the old and new blocks in 
BlockCacheUtil#validateBlockAddition, it will print error logs ,
{code:java}
hfile.BlockCacheUtil: Cached block contents differ, which should not have 
happened. cacheKey:XXX {code}
This will harm the actual replacement of the cached block, since we should 
replace or not when the cached block contents differ by nextBlockOnDiskSize. 

  was:
When caching the block by reading from hfile, the block cached will contains a 
header with the next block info.

But when caching the block by writing when cacheDataOnWrite is on, the block 
cached without the header of the next block info.

Then when comparing the old and new blocks in 
BlockCacheUtil#validateBlockAddition, it will print error logs ,
{code:java}
hfile.BlockCacheUtil: Cached block contents differ, which should not have 
happened. cacheKey:XXX {code}
This will harm the actual replacement of the cached lock, since we should 
replace or not when the cached block contents differ by nextBlockOnDiskSize. 


> Fix the cached block differ (which should not have happened) issue with 
> cacheDataOnWrite on
> ---
>
> Key: HBASE-27647
> URL: https://issues.apache.org/jira/browse/HBASE-27647
> Project: HBase
>  Issue Type: Bug
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When caching the block by reading from hfile, the block cached will contains 
> a header with the next block info.
> But when caching the block by writing when cacheDataOnWrite is on, the block 
> cached without the header of the next block info.
> Then when comparing the old and new blocks in 
> BlockCacheUtil#validateBlockAddition, it will print error logs ,
> {code:java}
> hfile.BlockCacheUtil: Cached block contents differ, which should not have 
> happened. cacheKey:XXX {code}
> This will harm the actual replacement of the cached block, since we should 
> replace or not when the cached block contents differ by nextBlockOnDiskSize. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27647) Fix the cached block differ (which should not have happened) issue with cacheDataOnWrite on

2023-02-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27647:
---
Affects Version/s: 2.5.3

> Fix the cached block differ (which should not have happened) issue with 
> cacheDataOnWrite on
> ---
>
> Key: HBASE-27647
> URL: https://issues.apache.org/jira/browse/HBASE-27647
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When caching the block by reading from hfile, the block cached will contains 
> a header with the next block info.
> But when caching the block by writing when cacheDataOnWrite is on, the block 
> cached without the header of the next block info.
> Then when comparing the old and new blocks in 
> BlockCacheUtil#validateBlockAddition, it will print error logs ,
> {code:java}
> hfile.BlockCacheUtil: Cached block contents differ, which should not have 
> happened. cacheKey:XXX {code}
> This will harm the actual replacement of the cached block, since we should 
> replace or not when the cached block contents differ by nextBlockOnDiskSize. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27672) Read RPC threads may BLOCK at the Configuration.get when using java compression

2023-02-27 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27672:
---
Description: 
As in the jstack info, we can see some RPC threads or compaction threads BLOCK,

!image-2023-02-27-19-22-52-704.png|width=976,height=355!

  was:
As in the jstack info, we can see some RPC threads or compaction threads BLOCK,

!image-2023-02-27-19-22-52-704.png!


> Read RPC threads may BLOCK at the Configuration.get when using java 
> compression
> ---
>
> Key: HBASE-27672
> URL: https://issues.apache.org/jira/browse/HBASE-27672
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.3
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Attachments: image-2023-02-27-19-22-52-704.png
>
>
> As in the jstack info, we can see some RPC threads or compaction threads 
> BLOCK,
> !image-2023-02-27-19-22-52-704.png|width=976,height=355!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27672) Read RPC threads may BLOCK at the Configuration.get when using java compression

2023-02-27 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27672:
--

 Summary: Read RPC threads may BLOCK at the Configuration.get when 
using java compression
 Key: HBASE-27672
 URL: https://issues.apache.org/jira/browse/HBASE-27672
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.3
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha
 Attachments: image-2023-02-27-19-22-52-704.png

As in the jstack info, we can see some RPC threads or compaction threads BLOCK,

!image-2023-02-27-19-22-52-704.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27647) Fix the cached block differ (which should not have happened) issue with cacheDataOnWrite on

2023-02-16 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27647:
---
Description: 
When caching the block by reading from hfile, the block cached will contains a 
header with the next block info.

But when caching the block by writing when cacheDataOnWrite is on, the block 
cached without the header of the next block info.

Then when comparing the old and new blocks in 
BlockCacheUtil#validateBlockAddition, it will print error logs ,
{code:java}
hfile.BlockCacheUtil: Cached block contents differ, which should not have 
happened. cacheKey:XXX {code}
This will harm the actual replacement of the cached lock, since we should 
replace or not when the cached block contents differ by nextBlockOnDiskSize. 

  was:
When caching the block by reading from hfile, the block cached will contains a 
header with the next block info.

But when caching the block by writing when cacheDataOnWrite is on, the block 
cached without the header of the next block info.

Then comparing the old and new blocking in 
BlockCacheUtil#validateBlockAddition, it will print error logs ,
{code:java}
hfile.BlockCacheUtil: Cached block contents differ, which should not have 
happened. cacheKey:XXX {code}
This will harm the actual replacement of the cached lock, since we should 
replace or not when the cached block contents differ by nextBlockOnDiskSize. 


> Fix the cached block differ (which should not have happened) issue with 
> cacheDataOnWrite on
> ---
>
> Key: HBASE-27647
> URL: https://issues.apache.org/jira/browse/HBASE-27647
> Project: HBase
>  Issue Type: Bug
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When caching the block by reading from hfile, the block cached will contains 
> a header with the next block info.
> But when caching the block by writing when cacheDataOnWrite is on, the block 
> cached without the header of the next block info.
> Then when comparing the old and new blocks in 
> BlockCacheUtil#validateBlockAddition, it will print error logs ,
> {code:java}
> hfile.BlockCacheUtil: Cached block contents differ, which should not have 
> happened. cacheKey:XXX {code}
> This will harm the actual replacement of the cached lock, since we should 
> replace or not when the cached block contents differ by nextBlockOnDiskSize. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27647) Fix the cached block differ (which should not have happened) issue with cacheDataOnWrite on

2023-02-16 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27647:
--

 Summary: Fix the cached block differ (which should not have 
happened) issue with cacheDataOnWrite on
 Key: HBASE-27647
 URL: https://issues.apache.org/jira/browse/HBASE-27647
 Project: HBase
  Issue Type: Bug
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


When caching the block by reading from hfile, the block cached will contains a 
header with the next block info.

But when caching the block by writing when cacheDataOnWrite is on, the block 
cached without the header of the next block info.

Then comparing the old and new blocking in 
BlockCacheUtil#validateBlockAddition, it will print error logs ,
{code:java}
hfile.BlockCacheUtil: Cached block contents differ, which should not have 
happened. cacheKey:XXX {code}
This will harm the actual replacement of the cached lock, since we should 
replace or not when the cached block contents differ by nextBlockOnDiskSize. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27646) Should not use pread when prefetching in HFilePreadReader

2023-02-16 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27646:
--

 Summary: Should not use pread when prefetching in HFilePreadReader
 Key: HBASE-27646
 URL: https://issues.apache.org/jira/browse/HBASE-27646
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


While prefetchOnOpen will read althrough the hfile, we should use stream read 
for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-21521) Expose master startup status via web UI

2023-02-15 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha resolved HBASE-21521.

Resolution: Fixed

Merged to branch-2+, thanks [~bbeaudreault] for reviewing.

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-21521) Expose master startup status via web UI

2023-02-15 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-21521:
---
Fix Version/s: 2.4.17

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27073) TestReplicationValueCompressedWAL.testMultiplePuts is flaky

2023-02-15 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689058#comment-17689058
 ] 

Xiaolin Ha commented on HBASE-27073:


Why they all failed at position 65536...

2023-02-15T07:48:13,027 DEBUG 
[RS_REFRESH_PEER-regionserver/jenkins-hbase10:0-0.replicationSource,2.replicationSource.wal-reader.jenkins-hbase10.apache.org%2C34825%2C1676447248661,2]
 wal.ProtobufLogReader(464): Encountered a malformed edit, seeking back to last 
good position in file, from 65538 to 65536

2023-02-02T16:53:34,165 DEBUG 
[RS_REFRESH_PEER-regionserver/zhangduo-VirtualBox:0-0.replicationSource,2.replicationSource.wal-reader.zhangduo-virtualbox%2C33915%2C1675327981383,2]
 wal.ProtobufLogReader(448): Encountered a malformed edit, seeking back to last 
good position in file, from 65558 to 65536 java.io.EOFException: Invalid PB, 
EOF? Ignoring; originalPosition=65536, currentPosition=65558, messageSize=21, 
currentAvailable=434

> TestReplicationValueCompressedWAL.testMultiplePuts is flaky
> ---
>
> Key: HBASE-27073
> URL: https://issues.apache.org/jira/browse/HBASE-27073
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0
> Environment: Java version: 1.8.0_322
> OS name: "linux", version: "5.10.0-13-arm64", arch: "aarch64", family: "unix"
>Reporter: Andrew Kyle Purtell
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
>
> org.apache.hadoop.hbase.replication.regionserver.TestReplicationValueCompressedWAL.testMultiplePuts
>   
Run 1: TestReplicationValueCompressedWAL.testMultiplePuts:56 Waited too 
> much time for replication
>   Run 2: PASS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-21521) Expose master startup status via web UI

2023-02-14 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688820#comment-17688820
 ] 

Xiaolin Ha commented on HBASE-21521:


Yes, I'll prepare a PR for branch-2, thanks. [~zhangduo] 

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27642) Expose master startup status via JMX

2023-02-14 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-27642:
---
Parent: HBASE-21521
Issue Type: Sub-task  (was: Improvement)

> Expose master startup status via JMX
> 
>
> Key: HBASE-27642
> URL: https://issues.apache.org/jira/browse/HBASE-27642
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Reporter: Xiaolin Ha
>Priority: Minor
>
> As described in HBASE-21521 by [~apurtell] , 
> add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27642) Expose master startup status via JMX

2023-02-14 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27642:
--

 Summary: Expose master startup status via JMX
 Key: HBASE-27642
 URL: https://issues.apache.org/jira/browse/HBASE-27642
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Xiaolin Ha


As described in HBASE-21521 by [~apurtell] , 

add an internal API to the master for tracking startup progress. Expose this 
information via JMX.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27458) Use ReadWriteLock for region scanner readpoint map

2023-02-14 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688446#comment-17688446
 ] 

Xiaolin Ha commented on HBASE-27458:


[~frostruan] Could you please prepare PRs for branch-2.4+?

> Use ReadWriteLock for region scanner readpoint map 
> ---
>
> Key: HBASE-27458
> URL: https://issues.apache.org/jira/browse/HBASE-27458
> Project: HBase
>  Issue Type: Improvement
>  Components: Scanners
>Affects Versions: 3.0.0-alpha-3
>Reporter: ruanhui
>Assignee: ruanhui
>Priority: Minor
> Fix For: 3.0.0-alpha-4
>
> Attachments: jstack-2.png
>
>
> Currently we manage the concurrency between the RegionScanner and 
> getSmallestReadPoint by synchronizing on the scannerReadPoints object. In our 
> production, we find that many read threads are blocked by this when we have a 
> heavy read load. 
> we need to get smallest read point when 
> a. flush a memstore 
> b. compact memstore/storefile 
> c. do delta operation like increment/append
> Usually the frequency of these operations is much less than read requests. 
> It's a little expensive to use an exclusive lock here because for region 
> scanners, what it need to do is just calcaulating readpoint and putting the 
> readpoint in the scanner readpoint map, which is thread-safe. Multiple read 
> threads can do this in parallel without synchronization.
> Based on the above consideration, maybe we can replace the synchronized lock 
> with readwrite lock. It will help improve the read performance if the 
> bottleneck is on the synchronization here.
> !jstack.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-21521) Expose master startup status via web UI

2023-02-13 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688343#comment-17688343
 ] 

Xiaolin Ha commented on HBASE-21521:


Merged the PR for 'Expose master startup status via web UI' to branch-2.4+. 
Thanks [~bbeaudreault] for reviewing. 

[~apurtell]  Should we still need to expose the startup information to JMX? If 
so, I'll add a new umbrella issue to cover both the web UI and JMX parts. 
Thanks. 

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-21521) Expose master startup status via web UI

2023-02-13 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-21521:
---
Fix Version/s: 2.5.4

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-21521) Expose master startup status via web UI

2023-02-13 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-21521:
---
Summary: Expose master startup status via web UI  (was: Expose master 
startup status via JMX and web UI)

> Expose master startup status via web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Kyle Purtell
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
> Attachments: hbase-21521-1.png, hbase-21521-2.png, hbase-21521-3.png, 
> hbase-21521-4.png, hbase-21521-revised-1.png, hbase-21521-revised-2.png
>
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27637) Zero length value would cause value compressor read nothing and not advance the position of the InputStream

2023-02-13 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688244#comment-17688244
 ] 

Xiaolin Ha commented on HBASE-27637:


Great! Thank you for your attention to this tricky issue. I think this can 
resolve the issue on our cluster.

> Zero length value would cause value compressor read nothing and not advance 
> the position of the InputStream
> ---
>
> Key: HBASE-27637
> URL: https://issues.apache.org/jira/browse/HBASE-27637
> Project: HBase
>  Issue Type: Bug
>  Components: dataloss, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
>
> This is a code sniff from the discussion of HBASE-27073
> {code}
>   public static void main(String[] args) throws Exception {
> CompressionContext ctx =
>   new CompressionContext(LRUDictionary.class, false, false, true, 
> Compression.Algorithm.GZ);
> ValueCompressor compressor = ctx.getValueCompressor();
> byte[] compressed = compressor.compress(new byte[0], 0, 0);
> System.out.println("compressed length: " + compressed.length);
> ByteArrayInputStream bis = new ByteArrayInputStream(compressed);
> int read = compressor.decompress(bis, compressed.length, new byte[0], 0, 
> 0);
> System.out.println("read length: " + read);
> System.out.println("position: " + (compressed.length - bis.available()));
> {code}
> And the output is
> {noformat}
> compressed length: 20
> read length: 0
> position: 0
> {noformat}
> So it turns out that, when compressing, an empty array will still generate 
> some output bytes but while reading, we will skip reading anything if we find 
> the output length is zero, so next time when we read from the stream, we will 
> start at a wrong position...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27621) Also clear the Dictionary when resetting when reading compressed WAL file

2023-02-10 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687330#comment-17687330
 ] 

Xiaolin Ha commented on HBASE-27621:


We can work well on our cluster with this patch and disable value compression, 
the replication stuck issue now seems relevant to the WAL value compression. 
Then it can explain why the stuck happens after many delete row operations. But 
by checking the WAL file using WALPrettyPrinter, the stop position is in the 
middle, we should notice the data loss issue with WAL value compression. Great 
job, thanks.

> Also clear the Dictionary when resetting when reading compressed WAL file
> -
>
> Key: HBASE-27621
> URL: https://issues.apache.org/jira/browse/HBASE-27621
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> After trying several times, now I can reproduce a critical problem when 
> reading compressed WAL file in replication.
> The problem is about how we construct the LRUDictionary when reset the 
> WALEntryStream. In the current design, we will not reconstruct the 
> LRUDictionary when reseting, but when reading again, we will call addEntry 
> directly to add 'new' word into the dict, which will mess up the dict and 
> cause data corruption.
> I've implemented a UT to simulate reading partial WAL entry in replication, 
> with the current code base, after reseting and reading again, we will stuck 
> there for ever.
> -The fix is to always use findEntry when constructing the dict when reading, 
> so we will not mess things up.-
> It turns out that the above solution does not work.
> Another possible fix is to always reconstruct the dict after reseting, we 
> will also clear the dict and reconstruct it again. But it is less efficient 
> as we need to read from the beginning to the position we want to seek to, 
> instead of seek to the position directly, especially when tailing the WAL 
> file which is currently being written.
> And notice that, the UT can only reproduce the problem in local file system, 
> on HDFS, the available method is implemented so if there is not enough data, 
> we will throw EOFException earlier before parsing cells with the compression 
> decoder, so we will not add  duplicated word to dict. But in real world, it 
> is possible that even if there are enough data to read, we could hit an 
> IOException while reading and lead to the same problem described above.
> And while fixing, I also found another problem that in TagConressionContext 
> and CompressionContext, we use the result of InputStream incorrectly, as we 
> just cast it to byte and test whether it is -1 to determine whether the field 
> is in the dict. The return value of InputStream.read is an int, and it will 
> return -1 if reaches EOF, but here we will consider it as not in dict... We 
> should throw EOFException instead.
> I'm not sure whether fix this can also fix HBASE-27073 but let's have a try 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27621) Always use findEntry to fill the Dictionary when reading compressed WAL file

2023-02-09 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686382#comment-17686382
 ] 

Xiaolin Ha commented on HBASE-27621:


I think we should also remove moveToHead() at line 
[https://github.com/apache/hbase/blob/master/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/LRUDictionary.java#L154],
 to ensure the compress and uncompress construct same dictionary. 

> Always use findEntry to fill the Dictionary when reading compressed WAL file
> 
>
> Key: HBASE-27621
> URL: https://issues.apache.org/jira/browse/HBASE-27621
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> After trying several times, now I can reproduce a critical problem when 
> reading compressed WAL file in replication.
> The problem is about how we construct the LRUDictionary when reset the 
> WALEntryStream. In the current design, we will not reconstruct the 
> LRUDictionary when reseting, but when reading again, we will call addEntry 
> directly to add 'new' word into the dict, which will mess up the dict and 
> cause data corruption.
> I've implemented a UT to simulate reading partial WAL entry in replication, 
> with the current code base, after reseting and reading again, we will stuck 
> there for ever.
> The fix is to always use findEntry when constructing the dict when reading, 
> so we will not mess things up.
> Another possible fix is to always reconstruct the dict after reseting, we 
> will also clear the dict and reconstruct it again. But it is less efficient 
> as we need to read from the beginning to the position we want to seek to, 
> instead of seek to the position directly, especially when tailing the WAL 
> file which is currently being written.
> And notice that, the UT can only reproduce the problem in local file system, 
> on HDFS, the available method is implemented so if there is not enough data, 
> we will throw EOFException earlier before parsing cells with the compression 
> decoder, so we will not add  duplicated word to dict. But in real world, it 
> is possible that even if there are enough data to read, we could hit an 
> IOException while reading and lead to the same problem described above.
> And while fixing, I also found another problem that in TagConressionContext 
> and CompressionContext, we use the result of InputStream incorrectly, as we 
> just cast it to byte and test whether it is -1 to determine whether the field 
> is in the dict. The return value of InputStream.read is an int, and it will 
> return -1 if reaches EOF, but here we will consider it as not in dict... We 
> should throw EOFException instead.
> I'm not sure whether fix this can also fix HBASE-27073 but let's have a try 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27621) Always use findEntry to fill the Dictionary when reading compressed WAL file

2023-02-08 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686143#comment-17686143
 ] 

Xiaolin Ha commented on HBASE-27621:


Tried on our cluster, it works well in most circumstances.

But for Delete+replication+wal compress, still stucks...

> Always use findEntry to fill the Dictionary when reading compressed WAL file
> 
>
> Key: HBASE-27621
> URL: https://issues.apache.org/jira/browse/HBASE-27621
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> After trying several times, now I can reproduce a critical problem when 
> reading compressed WAL file in replication.
> The problem is about how we construct the LRUDictionary when reset the 
> WALEntryStream. In the current design, we will not reconstruct the 
> LRUDictionary when reseting, but when reading again, we will call addEntry 
> directly to add 'new' word into the dict, which will mess up the dict and 
> cause data corruption.
> I've implemented a UT to simulate reading partial WAL entry in replication, 
> with the current code base, after reseting and reading again, we will stuck 
> there for ever.
> The fix is to always use findEntry when constructing the dict when reading, 
> so we will not mess things up.
> Another possible fix is to always reconstruct the dict after reseting, we 
> will also clear the dict and reconstruct it again. But it is less efficient 
> as we need to read from the beginning to the position we want to seek to, 
> instead of seek to the position directly, especially when tailing the WAL 
> file which is currently being written.
> And notice that, the UT can only reproduce the problem in local file system, 
> on HDFS, the available method is implemented so if there is not enough data, 
> we will throw EOFException earlier before parsing cells with the compression 
> decoder, so we will not add  duplicated word to dict. But in real world, it 
> is possible that even if there are enough data to read, we could hit an 
> IOException while reading and lead to the same problem described above.
> And while fixing, I also found another problem that in TagConressionContext 
> and CompressionContext, we use the result of InputStream incorrectly, as we 
> just cast it to byte and test whether it is -1 to determine whether the field 
> is in the dict. The return value of InputStream.read is an int, and it will 
> return -1 if reaches EOF, but here we will consider it as not in dict... We 
> should throw EOFException instead.
> I'm not sure whether fix this can also fix HBASE-27073 but let's have a try 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27621) Always use findEntry to fill the Dictionary when reading compressed WAL file

2023-02-08 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685864#comment-17685864
 ] 

Xiaolin Ha commented on HBASE-27621:


It's timely and great! I'll try this on our cluster. 

But we have already tried to fix the stuck by always reconstructing the dict 
after reseting, it doesn't work.

> Always use findEntry to fill the Dictionary when reading compressed WAL file
> 
>
> Key: HBASE-27621
> URL: https://issues.apache.org/jira/browse/HBASE-27621
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> After trying several times, now I can reproduce a critical problem when 
> reading compressed WAL file in replication.
> The problem is about how we construct the LRUDictionary when reset the 
> WALEntryStream. In the current design, we will not reconstruct the 
> LRUDictionary when reseting, but when reading again, we will call addEntry 
> directly to add 'new' word into the dict, which will mess up the dict and 
> cause data corruption.
> I've implemented a UT to simulate reading partial WAL entry in replication, 
> with the current code base, after reseting and reading again, we will stuck 
> there for ever.
> The fix is to always use findEntry when constructing the dict when reading, 
> so we will not mess things up.
> Another possible fix is to always reconstruct the dict after reseting, we 
> will also clear the dict and reconstruct it again. But it is less efficient 
> as we need to read from the beginning to the position we want to seek to, 
> instead of seek to the position directly, especially when tailing the WAL 
> file which is currently being written.
> And notice that, the UT can only reproduce the problem in local file system, 
> on HDFS, the available method is implemented so if there is not enough data, 
> we will throw EOFException earlier before parsing cells with the compression 
> decoder, so we will not add  duplicated word to dict. But in real world, it 
> is possible that even if there are enough data to read, we could hit an 
> IOException while reading and lead to the same problem described above.
> And while fixing, I also found another problem that in TagConressionContext 
> and CompressionContext, we use the result of InputStream incorrectly, as we 
> just cast it to byte and test whether it is -1 to determine whether the field 
> is in the dict. The return value of InputStream.read is an int, and it will 
> return -1 if reaches EOF, but here we will consider it as not in dict... We 
> should throw EOFException instead.
> I'm not sure whether fix this can also fix HBASE-27073 but let's have a try 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27544) RegionServer JVM crash when the RPC request size is too big

2022-12-22 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651184#comment-17651184
 ] 

Xiaolin Ha commented on HBASE-27544:


There are similar and duplicate issues, please see HBASE-26170,HBASE-25997.

> RegionServer JVM crash when the RPC request size is too big
> ---
>
> Key: HBASE-27544
> URL: https://issues.apache.org/jira/browse/HBASE-27544
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.4
>Reporter: Yiran Wu
>Priority: Major
>
> In our cluster,  JVM crash when the request size is too big
> hs_err.log
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f6c4bbb7b31, pid=5619, tid=0x7f3dc57b4700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # C  [libc.so.6+0x15bb31]  __memmove_ssse3_back+0x1ba1
> #
> # Core dump written. Default location: /home/user/core or core.5619 (max size 
> 1048576 kB). To ensure a full core dump, try "ulimit -c unlimited" before 
> starting Java again
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Stack: [0x7f3dc56b4000,0x7f3dc57b5000],  sp=0x7f3dc57b2d48,  free 
> space=1019k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libc.so.6+0x15bb31]  __memmove_ssse3_back+0x1ba1
> J 2301  sun.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V 
> (0 bytes) @ 0x7f6c35a3ae21 [0x7f6c35a3ad40+0xe1]
> j  
> org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+36
> j  
> org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+69
> j  
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+39
> j  
> org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+31
> j  
> org.apache.hadoop.hbase.CellUtil.cloneQualifier(Lorg/apache/hadoop/hbase/Cell;)[B+12
> j  org.apache.hadoop.hbase.ByteBufferKeyValue.getQualifierArray()[B+1
> j  
> org.apache.hadoop.hbase.CellUtil.getCellKeyAsString(Lorg/apache/hadoop/hbase/Cell;Ljava/util/function/Function;)Ljava/lang/String;+97
> j  
> org.apache.hadoop.hbase.CellUtil.getCellKeyAsString(Lorg/apache/hadoop/hbase/Cell;)Ljava/lang/String;+6
> j  
> org.apache.hadoop.hbase.CellUtil.toString(Lorg/apache/hadoop/hbase/Cell;Z)Ljava/lang/String;+16
> j  org.apache.hadoop.hbase.ByteBufferKeyValue.toString()Ljava/lang/String;+2
> j  
> org.apache.hadoop.hbase.client.Mutation.add(Lorg/apache/hadoop/hbase/Cell;)Lorg/apache/hadoop/hbase/client/Mutation;+28
> j  
> org.apache.hadoop.hbase.client.Put.add(Lorg/apache/hadoop/hbase/Cell;)Lorg/apache/hadoop/hbase/client/Put;+2
> J 19274 C2 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toPut(Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MutationProto;Lorg/apache/hadoop/hbase/CellScanner;)Lorg/apache/hadoop/hbase/client/Put;
>  (910 bytes) @ 0x7f6c386ed4e4 [0x7f6c386eb7a0+0x1d44]
> J 32557 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;Lorg/apache/hadoop/hbase/regionserver/HRegion;Lorg/apache/hadoop/hbase/quotas/OperationQuota;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;JLorg/apache/hadoop/hbase/quotas/ActivePolicyEnforcement;Z)V
>  (1046 bytes) @ 0x7f6c39d9e494 [0x7f6c39d9dcc0+0x7d4]
> J 29517 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(Lorg/apache/hadoop/hbase/regionserver/HRegion;Lorg/apache/hadoop/hbase/quotas/OperationQuota;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionAction;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;Ljava/util/List;JLorg/apache/hadoop/hbase/regionserver/RSRpcServices$RegionScannersCloseCallBack;Lorg/apache/hadoop/hbase/ipc/RpcCallContext;Lorg/apache/hadoop/hbase/quotas/ActivePolicyEnforcement;)Ljava/util/List;
>  (901 bytes) @ 0x7f6c39c25898 [0x7f6c39c24da0+0xaf8]
> J 31074 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (1119 bytes) @ 0x7f6c39e7dcd4 [0x7f6c39e7db20+0x1b4]
> J 28404 C2 
> 

[jira] [Commented] (HBASE-27455) RegionServer JVM crash when scan

2022-10-31 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626484#comment-17626484
 ] 

Xiaolin Ha commented on HBASE-27455:


Hi, [~gaofeng2022] , you can see whether HBASE-26155 and HBASE-26281 can 
resolve your problem.

> RegionServer JVM crash when scan
> 
>
> Key: HBASE-27455
> URL: https://issues.apache.org/jira/browse/HBASE-27455
> Project: HBase
>  Issue Type: Bug
>  Components: scan
>Affects Versions: 2.3.5
>Reporter: gaofeng
>Priority: Critical
> Attachments: hs_err_pid790137.log
>
>
> hs_err_pid790137.log
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f0389b452a6, pid=790137, tid=139652643038976
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 
> 1.8.0_20-b26)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x5832a6]  
> G1ParScanThreadState::copy_to_survivor_space(oopDesc*)+0x226
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
>  
> ---  T H R E A D  ---
>  
> Current thread (0x7f038405d000):  GCTaskThread [stack: 
> 0x7f036a1e2000,0x7f036a2e3000] [id=790310]
>  
> siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
> 0x00e0
>  
> Registers:
> RAX=0x, RBX=0x7f038a4ee240, RCX=0x0003, 
> RDX=0x0001
> RSP=0x7f036a2e1310, RBP=0x7f036a2e1370, RSI=0x0004f00ebbe8, 
> RDI=0x080007a8
> R8 =0x9e01d77d, R9 =0x7f036a2e1590, R10=0x0004e7c08208, 
> R11=0x7f0384994900
> R12=0x0004f00ebbe8, R13=0x7f038497ca80, R14=0x7f036a2e1590, 
> R15=0x7f036a2e1590
> RIP=0x7f0389b452a6, EFLAGS=0x00010246, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-27455) RegionServer JVM crash when scan

2022-10-31 Thread Xiaolin Ha (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha reassigned HBASE-27455:
--

Assignee: Xiaolin Ha

> RegionServer JVM crash when scan
> 
>
> Key: HBASE-27455
> URL: https://issues.apache.org/jira/browse/HBASE-27455
> Project: HBase
>  Issue Type: Bug
>  Components: scan
>Affects Versions: 2.3.5
>Reporter: gaofeng
>Assignee: Xiaolin Ha
>Priority: Critical
> Attachments: hs_err_pid790137.log
>
>
> hs_err_pid790137.log
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f0389b452a6, pid=790137, tid=139652643038976
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 
> 1.8.0_20-b26)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x5832a6]  
> G1ParScanThreadState::copy_to_survivor_space(oopDesc*)+0x226
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
>  
> ---  T H R E A D  ---
>  
> Current thread (0x7f038405d000):  GCTaskThread [stack: 
> 0x7f036a1e2000,0x7f036a2e3000] [id=790310]
>  
> siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
> 0x00e0
>  
> Registers:
> RAX=0x, RBX=0x7f038a4ee240, RCX=0x0003, 
> RDX=0x0001
> RSP=0x7f036a2e1310, RBP=0x7f036a2e1370, RSI=0x0004f00ebbe8, 
> RDI=0x080007a8
> R8 =0x9e01d77d, R9 =0x7f036a2e1590, R10=0x0004e7c08208, 
> R11=0x7f0384994900
> R12=0x0004f00ebbe8, R13=0x7f038497ca80, R14=0x7f036a2e1590, 
> R15=0x7f036a2e1590
> RIP=0x7f0389b452a6, EFLAGS=0x00010246, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >