My supposition was correct: data is being written continually to the cluster. 
Thus, it looks like there are always RITs across the cluster.

Without a lull in activity will balancing never begin?  Major compactions seem 
to have no effect on the overall balancing of data across nodes.

-----Original Message-----
From: Sean Busbey <bus...@apache.org> 
Sent: Tuesday, January 26, 2021 5:46 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: Region server idle

EXTERNAL

You can change the logging for just the balancer system so that not so much 
goes into the logs. The logger to change is 
org.apache.hadoop.hbase.master.balancer

On Tue, Jan 26, 2021 at 9:46 AM Marc Hoppins <marc.hopp...@eset.sk> wrote:
>
> I did change it on the master. The log filled up with a bunch of reads and 
> writes, so I discontinued it after about an hour.  Certainly, this was 
> because another IT-er had kicked off HDFS balance. Obviously, I didn't really 
> want to try running one tool while the other was already running.
>
> -----Original Message-----
> From: Sean Busbey <bus...@apache.org>
> Sent: Tuesday, January 26, 2021 3:59 PM
> To: Hbase-User <user@hbase.apache.org>
> Subject: Re: Region server idle
>
> EXTERNAL
>
> Hi Marc!
>
> Did turning up the balancer's logging confirm that it is actually making 
> plans? Did it confirm what weights are being used for the various balancer 
> functions?
>
> On Tue, Jan 26, 2021 at 8:47 AM Marc Hoppins <marc.hopp...@eset.sk> wrote:
> >
> > Region counts have crept up (very slowly) to
> >
> > Hbase19 - 15 regions
> > Hbase20 - 8 regions
> >
> > So, would either continual reads or, (more likely) continual writes have 
> > such an effect on region splits and/or hbase balancing ?
> >
> > It is confusing as 3 new servers (NN, RS) were added to the cluster back in 
> > October 2020 and they were well-integrated within a day or so.
> >
> > -----Original Message-----
> > From: Josh Elser <els...@apache.org>
> > Sent: Sunday, January 24, 2021 10:22 PM
> > To: user@hbase.apache.org
> > Subject: Re: Region server idle
> >
> > EXTERNAL
> >
> > Yes, each RegionServer has its own writeahead log (which is named with that 
> > RS' hostname).
> >
> > You'd want to look at the HBase master log, specifically for reasons as to 
> > why balancing is not naturally happening. It very well could be that you 
> > have other regions in transition (which may prevent balancing from 
> > happening). This is just one reason balancing may not naturally happening, 
> > but you should be able to see this in the active master log (potentially 
> > with enabling DEBUG on org.apache.hadoop.hbase, first).
> > Don't forget about the hbase shell command to request the balancer to 
> > immediately run (so you can look for that logging at a specific point in 
> > time).
> >
> > On 1/18/21 7:23 AM, Marc Hoppins wrote:
> > > I have been checking for days and there are no outstanding RITs.  Region 
> > > servers do not have their own WAL files, do they?
> > >
> > > What gives me pause is that, although the affected servers (hbase19 & 
> > > hbase20) have 11 and 3 regions respectively, there must be very little 
> > > useable data as the requests per second are negligible for hbase19 and 
> > > zero  for hbase20.
> > >
> > > I would have expected SOME movement to distribute data and work onto 
> > > these 'vacant' systems after more than 2 weeks.
> > >
> > > The circumstance behind hbase19 going offline is that a memory module had 
> > > vailed and dump data was constantly filling up tmp storage so the oncall 
> > > guy made the decision to shut the system down. Given that a lot of hbase 
> > > work is done in memory is there any possible way something still lingers 
> > > in memory somewhere that has not been flushed?
> > >
> > > As for hbase20, an IT guy decommissioned the host in the Cloudera console 
> > > and recommissioned it as a test to see if region balancing proceeded as 
> > > normal. Obviously, it hasn't. For obvious reasons, a second test has not 
> > > been performed.
> > >
> > > -----Original Message-----
> > > From: Josh Elser <els...@apache.org>
> > > Sent: Tuesday, January 12, 2021 4:56 PM
> > > To: user@hbase.apache.org
> > > Subject: Re: Region server idle
> > >
> > > EXTERNAL
> > >
> > > Yes, in general, HDFS rebalancing will cause a decrease in the 
> > > performance of HBase as it removes the ability for HBase to short-circuit 
> > > some read logic. It should not, however, cause any kind of errors or lack 
> > > of availability.
> > >
> > > You should feel free to investigate the RITs you have now, rather than 
> > > wait for a major compaction to finish. As a reminder, you can also force 
> > > one to happen now via the `major_compact` HBase shell command, for each 
> > > table (or at least the tables which are most important). Persistent RITs 
> > > will prevent balancing from happening, that may be your smoking gun.
> > >
> > > It may also be helpful for you to reach out to your vendor for support if 
> > > you have not done so already.
> > >
> > > On 1/12/21 6:11 AM, Marc Hoppins wrote:
> > >> I read that HDFS balancing doesn't sit well with HBASE balancing.  A 
> > >> colleague rebalanced HDFS on Friday. If I look for rebalance, despite me 
> > >> wandering through HBASE (Cloudera manager) it redirects to HDFS balance.
> > >>
> > >> I'd suggest I wait for major compaction to occur but who knows when that 
> > >> will be? Despite the default setting of 7 days in place, from what I 
> > >> read this will be dependent on no RITs being performed.  As this is not 
> > >> just a working cluster but one of the more important ones, I am not sure 
> > >> if we can finish up any RITs to make the database 'passive' enough to 
> > >> perform a major compaction.
> > >>
> > >> Once again, experience in this area may be giv8ing me misinformation.
> > >>
> > >> -----Original Message-----
> > >> From: Josh Elser <els...@apache.org>
> > >> Sent: Monday, January 11, 2021 5:34 PM
> > >> To: user@hbase.apache.org
> > >> Subject: Re: Region server idle
> > >>
> > >> EXTERNAL
> > >>
> > >> The Master stacktrace you have there does read as a bug, but it 
> > >> shouldn't be affecting balancing.
> > >>
> > >> That Chore is doing work to apply space quotas, but your quota here is 
> > >> only doing RPC (throttle) quotas. Might be something already fixed since 
> > >> the version you're on. I'll see if anything jumps out at me on Jira.
> > >>
> > >> If the Master isn't giving you any good logging, you could set 
> > >> the Log4j level to DEBUG for org.apache.hadoop.hbase (either via 
> > >> CM or the HBase UI for the active master, assuming that feature 
> > >> isn't disabled for security reasons in your org -- 
> > >> master.ui.readonly something something config property in 
> > >> hbase-site.xml)
> > >>
> > >> If DEBUG doesn't help, I'd set TRACE level for 
> > >> org.apache.hadoop.hbase.master.balancer. Granted, it might not be 
> > >> obvious to the untrained eye, but if you can share that DEBUG/TRACE 
> > >> after you manuall invoke the balancer again via hbase shell, it should 
> > >> be enough for those watching here.
> > >>
> > >> On 1/11/21 5:32 AM, Marc Hoppins wrote:
> > >>> OK. So I tried again after running kinit and got the following:
> > >>>
> > >>> Took 0.0010 seconds
> > >>> hbase(main):001:0> list_quotas
> > >>> OWNER                                            QUOTAS
> > >>>     USER => robot_urlrs                             TYPE => THROTTLE, 
> > >>> THROTTLE_TYPE => REQUEST_NUMBER, LIMIT => 100req/sec, SCOPE => MACHINE
> > >>> 1 row(s)
> > >>>
> > >>> Not sure what to make of it but it doesn't seem like it is enough to 
> > >>> prevent balancing.  There are other tables and (probably) other users.
> > >>>
> > >>> -----Original Message-----
> > >>> From: Marc Hoppins <marc.hopp...@eset.sk>
> > >>> Sent: Monday, January 11, 2021 9:52 AM
> > >>> To: user@hbase.apache.org
> > >>> Subject: RE: Region server idle
> > >>>
> > >>> EXTERNAL
> > >>>
> > >>> I tried. Appears to have failed reading data from hbase:meta. These are 
> > >>> repeated errors for the whole run of list_quotas.
> > >>>
> > >>> A balance task was run on Friday. It took 9+ hours. The affected host 
> > >>> had 6 regions - no procedures/locks or processes were running for those 
> > >>> 6 regions. Today, that host has 8 regions.  No real work being 
> > >>> performed on them.  The other server - which went idle as a result of 
> > >>> removing hbase19 host from hbase and re-inserting to hbase - is still 
> > >>> doing nothing and has no regions assigned.
> > >>>
> > >>> I was su - hbase hbase shell to run it.
> > >>>
> > >>> ****************
> > >>>
> > >>> HBase Shell
> > >>> Use "help" to get list of supported commands.
> > >>> Use "exit" to quit this interactive shell.
> > >>> For Reference, please visit:
> > >>> http://hbase.apache.org/2.0/book.html#shell
> > >>> Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov  8 05:44:07 PST 2019 Took 
> > >>> 0.0011 seconds hbase(main):001:0> list_quotas
> > >>> OWNER                                      QUOTAS
> > >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> > >>> attempts=8, exceptions:
> > >>> Mon Jan 11 09:16:46 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, javax.security.sasl.SaslException: Call to 
> > >>> dr1-hbase18.jumb                                                        
> > >>>                                                                         
> > >>>              o.hq.com/10.1.140.36:16020 failed on local exception: 
> > >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> > >>> GSSException: No valid credentials provi                                
> > >>>                                                                         
> > >>>                                      ded (Mechanism level: Failed to 
> > >>> find any Kerberos tgt)] [Caused by javax.security.sasl.SaslException: 
> > >>> GSS initiate failed [Caused by GSSException: No valid credentia         
> > >>>                                                                         
> > >>>                                                             ls provided 
> > >>> (Mechanism level: Failed to find any Kerberos tgt)]]
> > >>> Mon Jan 11 09:16:46 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, java.io.IOException: Call to dr1-hbase18.jumbo.hq.com/  
> > >>>                                                                         
> > >>>                                                                    
> > >>> 10.1.140.36:16020 failed on local exception: java.io.IOException: Can 
> > >>> not send request because relogin is in progress.
> > >>> Mon Jan 11 09:16:46 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, java.io.IOException: Call to dr1-hbase18.jumbo.hq.com/  
> > >>>                                                                         
> > >>>                                                                    
> > >>> 10.1.140.36:16020 failed on local exception: java.io.IOException: Can 
> > >>> not send request because relogin is in progress.
> > >>> Mon Jan 11 09:16:47 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, java.io.IOException: Call to dr1-hbase18.jumbo.hq.com/  
> > >>>                                                                         
> > >>>                                                                    
> > >>> 10.1.140.36:16020 failed on local exception: java.io.IOException: Can 
> > >>> not send request because relogin is in progress.
> > >>> Mon Jan 11 09:16:47 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, java.io.IOException: Call to dr1-hbase18.jumbo.hq.com/  
> > >>>                                                                         
> > >>>                                                                    
> > >>> 10.1.140.36:16020 failed on local exception: java.io.IOException: Can 
> > >>> not send request because relogin is in progress.
> > >>> Mon Jan 11 09:16:48 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, javax.security.sasl.SaslException: Call to 
> > >>> dr1-hbase18.jumb                                                        
> > >>>                                                                         
> > >>>              o.hq.com/10.1.140.36:16020 failed on local exception: 
> > >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> > >>> GSSException: No valid credentials provi                                
> > >>>                                                                         
> > >>>                                      ded (Mechanism level: Failed to 
> > >>> find any Kerberos tgt)] [Caused by javax.security.sasl.SaslException: 
> > >>> GSS initiate failed [Caused by GSSException: No valid credentia         
> > >>>                                                                         
> > >>>                                                             ls provided 
> > >>> (Mechanism level: Failed to find any Kerberos tgt)]]
> > >>> Mon Jan 11 09:16:50 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, java.io.IOException: Call to dr1-hbase18.jumbo.hq.com/  
> > >>>                                                                         
> > >>>                                                                    
> > >>> 10.1.140.36:16020 failed on local exception: java.io.IOException: Can 
> > >>> not send request because relogin is in progress.
> > >>> Mon Jan 11 09:16:54 CET 2021, 
> > >>> RpcRetryingCaller{globalStartTime=1610353006298, pause=100, 
> > >>> maxAttempts=8}, javax.security.sasl.SaslException: Call to 
> > >>> dr1-hbase18.jumb                                                        
> > >>>                                                                         
> > >>>              o.hq.com/10.1.140.36:16020 failed on local exception: 
> > >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> > >>> GSSException: No valid credentials provi                                
> > >>>                                                                         
> > >>>                                      ded (Mechanism level: Failed to 
> > >>> find any Kerberos tgt)] [Caused by javax.security.sasl.SaslException: 
> > >>> GSS initiate failed [Caused by GSSException: No valid credentia         
> > >>>                                                                         
> > >>>                                                             ls provided 
> > >>> (Mechanism level: Failed to find any Kerberos tgt)]]
> > >>>
> > >>> -----Original Message-----
> > >>> From: Stack <st...@duboce.net>
> > >>> Sent: Saturday, January 9, 2021 1:52 AM
> > >>> To: Hbase-User <user@hbase.apache.org>
> > >>> Subject: Re: Region server idle
> > >>>
> > >>> EXTERNAL
> > >>>
> > >>> Looking at code around exception, can you check your quota settings? 
> > >>> See refguide on how to list quotas. Look for table or namespace that is 
> > >>> empty or non-existant and fill in missing portion.
> > >>>
> > >>> This is master-side log? It is from a periodic task so perhaps 
> > >>> something else is in the way of the non-assign? Anything else in there 
> > >>> about balancing or why we are skipping assign to these servers? Try a 
> > >>> balance run in the shell and then check master log to see why no work 
> > >>> done?
> > >>>
> > >>> S
> > >>>
> > >>> On Fri, Jan 8, 2021 at 2:51 AM Marc Hoppins <marc.hopp...@eset.sk> 
> > >>> wrote:
> > >>>
> > >>>> Apologies again.  Here is the full error message.
> > >>>>
> > >>>> 2021-01-08 11:34:15,831 ERROR org.apache.hadoop.hbase.ScheduledChore:
> > >>>> Caught error
> > >>>> java.lang.IllegalStateException: Expected only one of namespace 
> > >>>> and tablename to be null
> > >>>>            at
> > >>>> org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore.getSnapshotsToComputeSize(SnapshotQuotaObserverChore.java:198)
> > >>>>            at
> > >>>> org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore._chore(SnapshotQuotaObserverChore.java:126)
> > >>>>            at
> > >>>> org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore.chore(SnapshotQuotaObserverChore.java:113)
> > >>>>            at
> > >>>> org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
> > >>>>            at
> > >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > >>>>            at 
> > >>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> > >>>>            at
> > >>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > >>>>            at
> > >>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > >>>>            at
> > >>>> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
> > >>>>            at
> > >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>>>            at
> > >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>>>            at java.lang.Thread.run(Thread.java:748)
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Marc Hoppins <marc.hopp...@eset.sk>
> > >>>> Sent: Friday, January 8, 2021 10:57 AM
> > >>>> To: user@hbase.apache.org
> > >>>> Subject: RE: Region server idle
> > >>>>
> > >>>> EXTERNAL
> > >>>>
> > >>>> So, I tried decommission that RS and recommission it.  No change.
> > >>>> Server still idle.
> > >>>>
> > >>>> Tried decommission another server and see if HBASE sets itself right.
> > >>>> Now I have two RS that are idle.
> > >>>>
> > >>>> ba-hbase18.jumbo.hq.com,16020,1604413480001     Tue Nov 03 15:24:40 CET
> > >>>> 2020    1 s     2.1.0-cdh6.3.2  13      471
> > >>>> ba-hbase19.jumbo.hq.com,16020,1610095488001     Fri Jan 08 09:44:48 CET
> > >>>> 2021    0 s     2.1.0-cdh6.3.2  0       6
> > >>>> ba-hbase20.jumbo.hq.com,16020,1610096850259     Fri Jan 08 10:07:30 CET
> > >>>> 2021    0 s     2.1.0-cdh6.3.2  0       0
> > >>>> ba-hbase21.jumbo.hq.com,16020,1604414101652     Tue Nov 03 15:35:01 CET
> > >>>> 2020    1 s     2.1.0-cdh6.3.2  15      447
> > >>>>
> > >>>>    From the logs:
> > >>>> 2021-01-08 10:25:36,875 ERROR org.apache.hadoop.hbase.ScheduledChore:
> > >>>> Caught error java.lang.IllegalStateException: Expected only one 
> > >>>> of namespace and tablename to be null
> > >>>>
> > >>>> This is reappearing in hbase master log
> > >>>>
> > >>>> M
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Sean Busbey <bus...@apache.org>
> > >>>> Sent: Thursday, January 7, 2021 7:30 PM
> > >>>> To: Hbase-User <user@hbase.apache.org>
> > >>>> Subject: Re: Region server idle
> > >>>>
> > >>>> EXTERNAL
> > >>>>
> > >>>> Sounds like https://issues.apache.org/jira/browse/HBASE-24139
> > >>>>
> > >>>> The description of that jira has a workaround.
> > >>>>
> > >>>> On Thu, Jan 7, 2021, 05:23 Marc Hoppins <marc.hopp...@eset.sk> wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I have a setup with 67 region servers. 29 Dec one system had 
> > >>>>> to be shut down to have EMM module swapped out - which took one work 
> > >>>>> day.
> > >>>>> Host was back online 30 Dec.
> > >>>>>
> > >>>>> My HBASE is very basic so I appreciate your patience.
> > >>>>>
> > >>>>> My understanding of the defaults that are setup is that a 
> > >>>>> major compaction should occur every 7 days.  Moreover, do I 
> > >>>>> assume that more extensive balancing may occur after this happens?
> > >>>>>
> > >>>>> When I check (via hbase master UI) the status of HBASE, I see 
> > >>>>> the
> > >>>>> following:
> > >>>>>
> > >>>>> ServerName
> > >>>>>
> > >>>>> Start time
> > >>>>>
> > >>>>> Last contact
> > >>>>>
> > >>>>> Version
> > >>>>>
> > >>>>> Requests Per Second
> > >>>>>
> > >>>>> Num. Regions
> > >>>>>
> > >>>>> ba-hbase16.jumbo.hq.com,16020,1604413068640<
> > >>>>> http://ba-hbase16.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:17:48 CET 2020
> > >>>>>
> > >>>>> 3 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 46
> > >>>>>
> > >>>>> 462
> > >>>>>
> > >>>>> ba-hbase17.jumbo.hq.com,16020,1604413274393<
> > >>>>> http://ba-hbase17.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:21:14 CET 2020
> > >>>>>
> > >>>>> 1 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 19
> > >>>>>
> > >>>>> 462
> > >>>>>
> > >>>>> ba-hbase18.jumbo.hq.com,16020,1604413480001<
> > >>>>> http://ba-hbase18.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:24:40 CET 2020
> > >>>>>
> > >>>>> 2 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 62
> > >>>>>
> > >>>>> 461
> > >>>>>
> > >>>>> ba-hbase19.jumbo.hq.com,16020,1609326754985<
> > >>>>> http://ba-hbase19.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Wed Dec 30 12:12:34 CET 2020
> > >>>>>
> > >>>>> 2 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 0
> > >>>>>
> > >>>>> 0
> > >>>>>
> > >>>>> ba-hbase20.jumbo.hq.com,16020,1604413895967<
> > >>>>> http://ba-hbase20.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:31:35 CET 2020
> > >>>>>
> > >>>>> 2 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 62
> > >>>>>
> > >>>>> 503
> > >>>>>
> > >>>>> ba-hbase21.jumbo.hq.com,16020,1604414101652<
> > >>>>> http://ba-hbase21.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:35:01 CET 2020
> > >>>>>
> > >>>>> 3 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 59
> > >>>>>
> > >>>>> 442
> > >>>>>
> > >>>>> ba-hbase22.jumbo.hq.com,16020,1604414308289<
> > >>>>> http://ba-hbase22.jumbo.hq.eset.com:16030/rs-status>
> > >>>>>
> > >>>>> Tue Nov 03 15:38:28 CET 2020
> > >>>>>
> > >>>>> 0 s
> > >>>>>
> > >>>>> 2.1.0-cdh6.3.2
> > >>>>>
> > >>>>> 40
> > >>>>>
> > >>>>> 438
> > >>>>>
> > >>>>>
> > >>>>> Why, after more than 7 days, is this host not hosting more (any) 
> > >>>>> regions?
> > >>>>> Should I initiate some kind of rebalancing?
> > >>>>>
> > >>>>> Thanks in advance.
> > >>>>>
> > >>>>> M
> > >>>>>
> > >>>>

Reply via email to