[jira] [Created] (HBASE-26302) Due to many procedures reload from master:store, hmaster takes too much time to initialize
bolao created HBASE-26302: - Summary: Due to many procedures reload from master:store, hmaster takes too much time to initialize Key: HBASE-26302 URL: https://issues.apache.org/jira/browse/HBASE-26302 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.3.5 Reporter: bolao Attachments: image-2021-09-28-11-33-23-375.png, image-2021-09-28-11-33-41-612.png when the hbase restart, we found hmaster takes much time to initialize. we add some logs for jars and found it's stuck in reloading procedure form master:store in ProcedureExecutor's init method {panel:title=1. the ProcedureExecutor logs only have} 2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(569) -Starting 30 core workers (bigger of cpus/4 or 16) with max (burst) worker count=300 2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(589) -Recovered RegionProcedureStore lease in 1 msec and don't have logs for load: [https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java#L602] {panel} 2. we add some logs like that(org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore#load): {code:java} // code placeholder loader.setMaxProcId(maxProcId); LOG.info("there are {} procedures load from master:store", procs.size()); ProcedureTree tree = ProcedureTree.build(procs); loader.load(tree.getValidProcs()); loader.handleCorrupted(tree.getCorruptedProcs()); {code} and grep log found that: 2021-09-24 11:23:16 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster] INFO org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(294) -there are 3357861 procedures load form master:store 3. we add some logs (org.apache.hadoop.hbase.procedure2.ProcedureExecutor#restoreLocks()) {code:java} // code placeholder private void restoreLocks() { Set restored = new HashSet<>(); Deque> stack = new ArrayDeque<>(); AtomicInteger num = new AtomicInteger(); procedures.values().forEach(proc -> { for (;;) { LOG.info("this is num {}", num.incrementAndGet()); if (restored.contains(proc.getProcId())) { restoreLocks(stack, restored); return; } if (!proc.hasParent()) { restoreLock(proc, restored); restoreLocks(stack, restored); return; } stack.push(proc); proc = procedures.get(proc.getParentProcId()); } }); } {code} found when the num added to 16W, it's spended about 20 minutes. 4. By viewing the metadata of the hfile, the Earliest time is 28th June. !image-2021-09-28-11-33-41-612.png! 5. review the souce code, the master:store ttl is default value(HConstants.FOREVER) [https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/region/MasterRegionFactory.java#L82] and the scan for maste:store don't have filter too. [https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L263] so we have some questions: 1. Is it reasonable to set master:store ttl is HConstants.FOREVER? 2. can we keep a small number for master:store by deleting some historical procedure? Look forward to your reply! thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26301) Backport backup/restore to branch-2
Bryan Beaudreault created HBASE-26301: - Summary: Backport backup/restore to branch-2 Key: HBASE-26301 URL: https://issues.apache.org/jira/browse/HBASE-26301 Project: HBase Issue Type: New Feature Reporter: Bryan Beaudreault I was discussing this great feature with [~rda3mon] on Slack. His company is using this on their fork of hbase 2.1. We're working on upgrading to 2.4 now, and have our own home grown backup/restore system which is not as sophisticated as the native solution. If this solution was backported to branch-2, we would strongly consider adopting it as we finish up our upgrade. It looks like this was originally cut from 2.0 due to release timeline pressures: https://issues.apache.org/jira/browse/HBASE-19407, and now suffers from a lack of community support. This might make sense since it only exists in 3.x, which is not yet released. It would be great to backport this to branch-2 so that it reach a wider audience and adoption -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26300) Incremental backup may be broken by MasterRegion implementation in 3.x
Bryan Beaudreault created HBASE-26300: - Summary: Incremental backup may be broken by MasterRegion implementation in 3.x Key: HBASE-26300 URL: https://issues.apache.org/jira/browse/HBASE-26300 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault I've been reading through the incremental backup implementation in master branch to see how it handled some scenarios our own internal incremental backup process has to handle. One such failure we recently encountered as part of our ongoing hbase2 upgrade is the new $masterlocalwal$ suffixed files in the oldWALs dir. Our parsing of the WAL files assumed that the last part of the file name would be a timestamp, which is not the case for these MasterRegion WALs. I see [IncrementalBackupManager excludes ProcV2Wals|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java#L104-L117], but I think that was replaced in https://issues.apache.org/jira/browse/HBASE-24408 with a MasterRegion. The new MasterRegion uses normal WALs, but archives them with a suffix "$masterlocalwal$". I believe this would fail [around line 222 of IncrementalBackupManager|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java#L222], because [BackupUtils.getCreationTime|https://github.com/apache/hbase/blob/master/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/util/BackupUtils.java#L383-L390] similarly expects the file names to end with a timestamp. Unfortunately I am not set up to run master branch or test the backup/restore functionality, but I wanted to log this because I happened to stumble upon it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[DISCUSS] use of Apache Yetus Audience Annotations
Hi! Heads up that a discussion has started in Apache Yetus about dropping the Audience Annotations and associated javadoc tooling due to lack of community support[1]. The current cutting issue AFAICT is that things there haven't been updated for the changes in how doclets are handled in JDK9+. We still center all of our API scoping on this library. In general it has been solid and required relatively little investment on our part. Personally, I think this has worked really well for us so far and we ought to try to keep the shared resource going. Unfortunately, I don't currently have the cycles to personally step up in the Yetus project. What do folks think? Anyone able to help out in Yetus? Should we start moving to maintain this tooling internal to HBase? [1]: https://s.apache.org/ybdl6 "[DISCUSS] Drop JDK8; audience-annotations" from d...@yetus.apache.org
[jira] [Created] (HBASE-26299) Fix TestHTableTracing.testTableClose for nightly build of branch-2
Tak-Lon (Stephen) Wu created HBASE-26299: Summary: Fix TestHTableTracing.testTableClose for nightly build of branch-2 Key: HBASE-26299 URL: https://issues.apache.org/jira/browse/HBASE-26299 Project: HBase Issue Type: Bug Components: test, tracing Affects Versions: 2.5.0 Reporter: Tak-Lon (Stephen) Wu sometime isn't right with the last testTableClose when we close the table and the connection, need to figure out why it's not working in the unit test. {code} [ERROR] org.apache.hadoop.hbase.client.TestHTableTracing.testTableClose Time elapsed: 0.001 s <<< ERROR! java.lang.IllegalStateException: GlobalOpenTelemetry.set has already been called. GlobalOpenTelemetry.set must be called only once before any calls to GlobalOpenTelemetry.get. If you are using the OpenTelemetrySdk, use OpenTelemetrySdkBuilder.buildAndRegisterGlobal instead. Previous invocation set to cause of this exception. at io.opentelemetry.api.GlobalOpenTelemetry.set(GlobalOpenTelemetry.java:83) at io.opentelemetry.sdk.testing.junit4.OpenTelemetryRule.before(OpenTelemetryRule.java:95) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Throwable at io.opentelemetry.api.GlobalOpenTelemetry.set(GlobalOpenTelemetry.java:91) at io.opentelemetry.api.GlobalOpenTelemetry.get(GlobalOpenTelemetry.java:61) at io.opentelemetry.api.GlobalOpenTelemetry.getTracer(GlobalOpenTelemetry.java:110) at org.apache.hadoop.hbase.trace.TraceUtil.getGlobalTracer(TraceUtil.java:71) at org.apache.hadoop.hbase.trace.TraceUtil.createSpan(TraceUtil.java:95) at org.apache.hadoop.hbase.trace.TraceUtil.createSpan(TraceUtil.java:78) at org.apache.hadoop.hbase.trace.TraceUtil.lambda$trace$1(TraceUtil.java:176) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:180) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:176) at org.apache.hadoop.hbase.client.ConnectionImplementation.close(ConnectionImplementation.java:2110) at org.apache.hadoop.hbase.client.ConnectionImplementation.finalize(ConnectionImplementation.java:2149) at java.lang.System$2.invokeFinalize(System.java:1273) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102) at java.lang.ref.Finalizer.access$100(Finalizer.java:34) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)