[jira] [Created] (HBASE-26302) Due to many procedures reload from master:store, hmaster takes too much time to initialize

2021-09-27 Thread bolao (Jira)
bolao created HBASE-26302:
-

 Summary: Due to many procedures reload from master:store, hmaster 
takes too much time to initialize
 Key: HBASE-26302
 URL: https://issues.apache.org/jira/browse/HBASE-26302
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.3.5
Reporter: bolao
 Attachments: image-2021-09-28-11-33-23-375.png, 
image-2021-09-28-11-33-41-612.png

      when the hbase restart, we found hmaster takes much time to initialize. 
we add some logs for jars and found it's stuck in reloading procedure form 
master:store in ProcedureExecutor's init method

 
{panel:title=1. the ProcedureExecutor logs only have}
2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] 
INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(569) -Starting 
30 core workers (bigger of cpus/4 or 16) with max (burst) worker count=300
2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] 
INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(589) -Recovered 
RegionProcedureStore lease in 1 msec
and don't have logs for load:

[https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java#L602]

 
{panel}
2. we add some logs like 
that(org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore#load):

 
{code:java}
// code placeholder
loader.setMaxProcId(maxProcId);
LOG.info("there are {} procedures load from master:store", procs.size());
ProcedureTree tree = ProcedureTree.build(procs);
loader.load(tree.getValidProcs());
loader.handleCorrupted(tree.getCorruptedProcs());
{code}
and grep log found that:

2021-09-24 11:23:16 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] 
INFO 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(294) 
-there are 3357861 procedures load form master:store

3. we add some logs 
(org.apache.hadoop.hbase.procedure2.ProcedureExecutor#restoreLocks()) 
{code:java}
// code placeholder
 private void restoreLocks() {
Set restored = new HashSet<>();
Deque> stack = new ArrayDeque<>();
AtomicInteger num = new AtomicInteger();
procedures.values().forEach(proc -> {
  for (;;) {
LOG.info("this is num {}", num.incrementAndGet());
if (restored.contains(proc.getProcId())) {
  restoreLocks(stack, restored);
  return;
}
if (!proc.hasParent()) {
  restoreLock(proc, restored);
  restoreLocks(stack, restored);
  return;
}
stack.push(proc);
proc = procedures.get(proc.getParentProcId());
  }
});
  }
{code}
found when the num added to 16W, it's spended about 20 minutes.

4. By viewing the metadata of the hfile, the Earliest time is 28th June.

!image-2021-09-28-11-33-41-612.png!

5. review the souce code, the master:store ttl is default 
value(HConstants.FOREVER)

[https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/region/MasterRegionFactory.java#L82]

and the scan for maste:store don't have filter too.

[https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L263]

 

so we have some questions:

1. Is it reasonable to set master:store ttl is HConstants.FOREVER?
2. can we keep a small number for master:store by deleting some historical 
procedure?
Look forward to your reply! thanks!

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26301) Backport backup/restore to branch-2

2021-09-27 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-26301:
-

 Summary: Backport backup/restore to branch-2
 Key: HBASE-26301
 URL: https://issues.apache.org/jira/browse/HBASE-26301
 Project: HBase
  Issue Type: New Feature
Reporter: Bryan Beaudreault


I was discussing this great feature with [~rda3mon] on Slack. His company is 
using this on their fork of hbase 2.1. We're working on upgrading to 2.4 now, 
and have our own home grown backup/restore system which is not as sophisticated 
as the native solution. If this solution was backported to branch-2, we would 
strongly consider adopting it as we finish up our upgrade.

It looks like this was originally cut from 2.0 due to release timeline 
pressures: https://issues.apache.org/jira/browse/HBASE-19407, and now suffers 
from a lack of community support. This might make sense since it only exists in 
3.x, which is not yet released.

It would be great to backport this to branch-2 so that it reach a wider 
audience and adoption



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26300) Incremental backup may be broken by MasterRegion implementation in 3.x

2021-09-27 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-26300:
-

 Summary: Incremental backup may be broken by MasterRegion 
implementation in 3.x
 Key: HBASE-26300
 URL: https://issues.apache.org/jira/browse/HBASE-26300
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


I've been reading through the incremental backup implementation in master 
branch to see how it handled some scenarios our own internal incremental backup 
process has to handle. One such failure we recently encountered as part of our 
ongoing hbase2 upgrade is the new $masterlocalwal$ suffixed files in the 
oldWALs dir. Our parsing of the WAL files assumed that the last part of the 
file name would be a timestamp, which is not the case for these MasterRegion 
WALs.

I see [IncrementalBackupManager excludes 
ProcV2Wals|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java#L104-L117],
 but I think that was replaced in 
https://issues.apache.org/jira/browse/HBASE-24408 with a MasterRegion. The new 
MasterRegion uses normal WALs, but archives them with a suffix 
"$masterlocalwal$".

I believe this would fail [around line 222 of 
IncrementalBackupManager|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java#L222],
 because 
[BackupUtils.getCreationTime|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/util/BackupUtils.java#L383-L390]
 similarly expects the file names to end with a timestamp.

Unfortunately I am not set up to run master branch or test the backup/restore 
functionality, but I wanted to log this because I happened to stumble upon it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] use of Apache Yetus Audience Annotations

2021-09-27 Thread Sean Busbey
Hi!

Heads up that a discussion has started in Apache Yetus about dropping
the Audience Annotations and associated javadoc tooling due to lack of
community support[1]. The current cutting issue AFAICT is that things
there haven't been updated for the changes in how doclets are handled
in JDK9+.

We still center all of our API scoping on this library. In general it
has been solid and required relatively little investment on our part.

Personally, I think this has worked really well for us so far and we
ought to try to keep the shared resource going. Unfortunately, I don't
currently have the cycles to personally step up in the Yetus project.

What do folks think? Anyone able to help out in Yetus? Should we start
moving to maintain this tooling internal to HBase?

[1]:
https://s.apache.org/ybdl6
"[DISCUSS] Drop JDK8; audience-annotations" from d...@yetus.apache.org


[jira] [Created] (HBASE-26299) Fix TestHTableTracing.testTableClose for nightly build of branch-2

2021-09-27 Thread Tak-Lon (Stephen) Wu (Jira)
Tak-Lon (Stephen) Wu created HBASE-26299:


 Summary: Fix TestHTableTracing.testTableClose for nightly build of 
branch-2
 Key: HBASE-26299
 URL: https://issues.apache.org/jira/browse/HBASE-26299
 Project: HBase
  Issue Type: Bug
  Components: test, tracing
Affects Versions: 2.5.0
Reporter: Tak-Lon (Stephen) Wu


sometime isn't right with the last testTableClose when we close the table and 
the connection, need to figure out why it's not working in the unit test. 

{code}
[ERROR] org.apache.hadoop.hbase.client.TestHTableTracing.testTableClose  Time 
elapsed: 0.001 s  <<< ERROR!
java.lang.IllegalStateException: GlobalOpenTelemetry.set has already been 
called. GlobalOpenTelemetry.set must be called only once before any calls to 
GlobalOpenTelemetry.get. If you are using the OpenTelemetrySdk, use 
OpenTelemetrySdkBuilder.buildAndRegisterGlobal instead. Previous invocation set 
to cause of this exception.
at 
io.opentelemetry.api.GlobalOpenTelemetry.set(GlobalOpenTelemetry.java:83)
at 
io.opentelemetry.sdk.testing.junit4.OpenTelemetryRule.before(OpenTelemetryRule.java:95)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at 
org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Throwable
at 
io.opentelemetry.api.GlobalOpenTelemetry.set(GlobalOpenTelemetry.java:91)
at 
io.opentelemetry.api.GlobalOpenTelemetry.get(GlobalOpenTelemetry.java:61)
at 
io.opentelemetry.api.GlobalOpenTelemetry.getTracer(GlobalOpenTelemetry.java:110)
at 
org.apache.hadoop.hbase.trace.TraceUtil.getGlobalTracer(TraceUtil.java:71)
at org.apache.hadoop.hbase.trace.TraceUtil.createSpan(TraceUtil.java:95)
at org.apache.hadoop.hbase.trace.TraceUtil.createSpan(TraceUtil.java:78)
at 
org.apache.hadoop.hbase.trace.TraceUtil.lambda$trace$1(TraceUtil.java:176)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:180)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:176)
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.close(ConnectionImplementation.java:2110)
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.finalize(ConnectionImplementation.java:2149)
at java.lang.System$2.invokeFinalize(System.java:1273)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)