[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-25480: Labels: JMX NullPointerException master (was: ) > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.4.0, 2.4.1 >Reporter: Andrey Elenskiy >Assignee: Anjan Das >Priority: Major > Labels: JMX, NullPointerException, master > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-25480: Affects Version/s: 2.4.1 > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.4.0, 2.4.1 >Reporter: Andrey Elenskiy >Assignee: Anjan Das >Priority: Major > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263043#comment-17263043 ] Andrey Elenskiy commented on HBASE-25480: - Yup, please go ahead [~dasanjan1296] ! > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.4.0 >Reporter: Andrey Elenskiy >Priority: Major > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-25480: Issue Type: Bug (was: Brainstorming) > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.4.0 >Reporter: Andrey Elenskiy >Priority: Major > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261478#comment-17261478 ] Andrey Elenskiy commented on HBASE-25480: - Updated to major because it breaks observability of backup masters making it look like there are none in Prometheus for example. > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Brainstorming > Components: master >Affects Versions: 2.4.0 >Reporter: Andrey Elenskiy >Priority: Major > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master
[ https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-25480: Priority: Major (was: Minor) > NPE when getting metrics of backup master > - > > Key: HBASE-25480 > URL: https://issues.apache.org/jira/browse/HBASE-25480 > Project: HBase > Issue Type: Brainstorming > Components: master >Affects Versions: 2.4.0 >Reporter: Andrey Elenskiy >Priority: Major > > Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() > when getting metrics via JMX on backup master. It appears due to the fact > that regionNormalizerManager is null in backup masters as it's only > initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25480) NPE when getting metrics of backup master
Andrey Elenskiy created HBASE-25480: --- Summary: NPE when getting metrics of backup master Key: HBASE-25480 URL: https://issues.apache.org/jira/browse/HBASE-25480 Project: HBase Issue Type: Brainstorming Components: master Affects Versions: 2.4.0 Reporter: Andrey Elenskiy Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() when getting metrics via JMX on backup master. It appears due to the fact that regionNormalizerManager is null in backup masters as it's only initialized by HMaster.finishActiveMasterInitialization(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233860#comment-17233860 ] Andrey Elenskiy commented on HBASE-22665: - Hitting exactly the same issue on 2.2.4. The issue was exposed when the host was under massive load causing various pauses and OOMs across the OS. All the handlers are stuck on SyncFuture.get() while AsyncFSWal thread is just waiting on condition. Also, regionserver didn't trigger AbstractFSWAL.shutdown() initially and was stuck without it. Only once I connected with jdb to check the state, the shutdown() was called and it was stuck exiting until a timeout killed the process (prior to that it was stuck appending). Here is how the handlers look like: {code:java} Thread 48 (RpcServer.default.FPBQ.Fifo.handler=7,queue=1,port=16201): State: TIMED_WAITING Blocked count: 1193588 Waited count: 2105228 Stack: java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142) org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:722) org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:637) org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8588) org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:7946) org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4130) org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4072) org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4003) org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1042) org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:974) org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:937) org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2755) org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42290) org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318){code} How AsynFSWal looks like: {code:java} Thread 164 (AsyncFSWAL-0): State: WAITING Blocked count: 6198 Waited count: 32255423 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@bf1c663 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748){code} There were lots of logs like this right before it got stuck: {code:java} 2020-11-17 14:37:53,902 WARN [AsyncFSWAL-0] wal.AsyncFSWAL: sync failed java.io.IOException: Connection to 192.168.2.23/192.168.2.23:15010 closed at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:289) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:236) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:99) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:288) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) at org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) at
[jira] [Updated] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-22665: Attachment: hbase.log > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: HBASE-22665-UT.patch, hbase.log, > image-2019-07-08-16-07-37-664.png, image-2019-07-08-16-08-26-777.png, > image-2019-07-08-16-14-43-455.png, jstack_20190625, jstack_20190704_1, > jstack_20190704_2, rs.log.part1, rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24920) A tool to rewrite corrupted HFiles
[ https://issues.apache.org/jira/browse/HBASE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188033#comment-17188033 ] Andrey Elenskiy commented on HBASE-24920: - Was away for a week and just attempted to write a prototype. The tool is fairly simple: if HFileScanner.next() throws exception, search for DATABLK magic offset in FSInputStream and resume from there. However, I realized that it wouldn't be possible to have it be part of hbase operator tools. I rely on some private classes in org.apache.hadoop.hbase.io.hfile package which change quite a bit between versions (2.2.5, 2.3.0, master are all different) making it impossible as a drop in jar. So, there are couple of choices here: 1. package the tool with dependencies and pin to some hbase version (which one?) 2. package along with hbase being part of "hbase hfile" command (probably why [~stack] suggested it in the first place ;)) 3. completely re-implement needed pieces in hbase operator tool so that there are no dependencies on hbase version (making this tool hard to maintain). > A tool to rewrite corrupted HFiles > -- > > Key: HBASE-24920 > URL: https://issues.apache.org/jira/browse/HBASE-24920 > Project: HBase > Issue Type: Brainstorming > Components: hbase-operator-tools >Reporter: Andrey Elenskiy >Priority: Major > > Typically I have been dealing with corrupted HFiles (due to loss of hdfs > blocks) by just removing them. However, It always seemed wasteful to throw > away the entire HFile (which can be hundreds of gigabytes), just because one > hdfs block is missing (128MB). > I think there's a possibility for a tool that can rewrite an HFile by > skipping corrupted blocks. > There can be multiple types of issues with hdfs blocks but any of them can be > treated as if the block doesn't exist: > 1. All the replicas can be lost > 2. The block can be corrupted due to some bug in hdfs (I've recently run into > HDFS-15186 by experimenting with EC). > At the simplest the tool can be a local mapreduce job (mapper only) with a > custom HFile reader input that can seek to next DATABLK to skip corrupted > hdfs blocks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24920) A tool to rewrite corrupted HFiles
[ https://issues.apache.org/jira/browse/HBASE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181463#comment-17181463 ] Andrey Elenskiy commented on HBASE-24920: - I currently like that `hbase hfile` is read only and allows to debug things (I use it for debugging the corrupt blocks at the moment), the name of the class is even HFilePrettyPrinter. On the mailing list Sean suggested to put it in the repo with `hbase-operator-tools` and I prefer that as it aligns better with the goal of that repo (correct bugs/inconsistencies via operator intervention). What do you think? > A tool to rewrite corrupted HFiles > -- > > Key: HBASE-24920 > URL: https://issues.apache.org/jira/browse/HBASE-24920 > Project: HBase > Issue Type: Brainstorming > Components: hbase-operator-tools >Reporter: Andrey Elenskiy >Priority: Major > > Typically I have been dealing with corrupted HFiles (due to loss of hdfs > blocks) by just removing them. However, It always seemed wasteful to throw > away the entire HFile (which can be hundreds of gigabytes), just because one > hdfs block is missing (128MB). > I think there's a possibility for a tool that can rewrite an HFile by > skipping corrupted blocks. > There can be multiple types of issues with hdfs blocks but any of them can be > treated as if the block doesn't exist: > 1. All the replicas can be lost > 2. The block can be corrupted due to some bug in hdfs (I've recently run into > HDFS-15186 by experimenting with EC). > At the simplest the tool can be a local mapreduce job (mapper only) with a > custom HFile reader input that can seek to next DATABLK to skip corrupted > hdfs blocks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24920) A tool to rewrite corrupted HFiles
Andrey Elenskiy created HBASE-24920: --- Summary: A tool to rewrite corrupted HFiles Key: HBASE-24920 URL: https://issues.apache.org/jira/browse/HBASE-24920 Project: HBase Issue Type: Brainstorming Components: hbase-operator-tools Reporter: Andrey Elenskiy Typically I have been dealing with corrupted HFiles (due to loss of hdfs blocks) by just removing them. However, It always seemed wasteful to throw away the entire HFile (which can be hundreds of gigabytes), just because one hdfs block is missing (128MB). I think there's a possibility for a tool that can rewrite an HFile by skipping corrupted blocks. There can be multiple types of issues with hdfs blocks but any of them can be treated as if the block doesn't exist: 1. All the replicas can be lost 2. The block can be corrupted due to some bug in hdfs (I've recently run into HDFS-15186 by experimenting with EC). At the simplest the tool can be a local mapreduce job (mapper only) with a custom HFile reader input that can seek to next DATABLK to skip corrupted hdfs blocks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24919) A tool to rewrite corrupted HFiles
Andrey Elenskiy created HBASE-24919: --- Summary: A tool to rewrite corrupted HFiles Key: HBASE-24919 URL: https://issues.apache.org/jira/browse/HBASE-24919 Project: HBase Issue Type: Brainstorming Components: hbase-operator-tools Reporter: Andrey Elenskiy Typically I have been dealing with corrupted HFiles (due to loss of hdfs blocks) by just removing them. However, It always seemed wasteful to throw away the entire HFile (which can be hundreds of gigabytes), just because one hdfs block is missing (128MB). I think there's a possibility for a tool that can rewrite an HFile by skipping corrupted blocks. There can be multiple types of issues with hdfs blocks but any of them can be treated as if the block doesn't exist: 1. All the replicas can be lost 2. The block can be corrupted due to some bug in hdfs (I've recently run into HDFS-15186 by experimenting with EC). At the simplest the tool can be a local mapreduce job (mapper only) with a custom HFile reader input that can seek to next DATABLK to skip corrupted hdfs blocks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI
[ https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119226#comment-17119226 ] Andrey Elenskiy commented on HBASE-24438: - The only way I was able to reproduce this is by restarting master right after it handled SCP for a regionserver. Looking at the code I've found this function in ServerCrashProcedure.java that handles setting of the task: {code:java} void updateProgress(boolean updateState) { String msg = "Processing ServerCrashProcedure of " + serverName; if (status == null) { status = TaskMonitor.get().createStatus(msg); return; } if (currentRunningState == ServerCrashState.SERVER_CRASH_FINISH) { status.markComplete(msg + " done"); return; } if (updateState) { currentRunningState = getCurrentState(); } int childrenLatch = getChildrenLatch(); status.setStatus(msg + " current State " + currentRunningState + (childrenLatch > 0 ? "; remaining num of running child procedures = " + childrenLatch : "")); } {code} Given that the "status" part of the MonitoredTask says "null" (case see that UI just shows "since" for Status), it means that updateProgress was called only once. Then looking at the places where the updateProgress can be called are: * executeFromState() called by ProcedureExecutor and cannot be the one as SCP is not listed in Procedures & Locks tab and never actually "Started" * TransitRegionStateProcedure.confirmOpened calls ServerCrashProcedure.updateProgress and also cannot be the one as SCP is not in procedures list * deserializeStateData() is called when the procedure WAL is de-serialized and is likely responsible for this "stale" task The procedure never materializes because it has actually been completed by previous master, so it just gets de-serialized and thrown away by the new master. Does that sounds like a possible case? It looks like that side effect was indeed added to deserializeStateData by https://issues.apache.org/jira/browse/HBASE-21647 but unfortunately I don't see any comments explaining the reason. Any ideas? > Stale ServerCrashProcedure task in HBase Master UI > -- > > Key: HBASE-24438 > URL: https://issues.apache.org/jira/browse/HBASE-24438 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > HDFS 3.1.3 with erasure coding enabled > Kubernetes >Reporter: Andrey Elenskiy >Priority: Major > > Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with > ServerCrashProcedure after master failover. The procedures have finished with > SUCCESS on a previously active HBase master and aren't showing in "Procedures > & Locks". > Based on the logs, both of those regionserver were carrying hbase:meta (logs > are sorted newest first grepped for those specific servers that have stale > ServerCrashProcedures): > {noformat} > 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] > procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; > ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true in 2.5290sec > 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: removed crashed server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after > splitting done > 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > dead splitlog workers > [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: > Log dir for server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not > exist > 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions > 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: pid=38, > state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true found RIT pid=20, ppid=18, > state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > rit=ABNORMALLY_CLOSED, > location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > table=hbase:meta, region=1588230740 > 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in > [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting] > in 0ms
[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119146#comment-17119146 ] Andrey Elenskiy commented on HBASE-22041: - > When did K8S register in DNS the new pod? Given the eventually consistent nature of k8s, it's possible that the mapping in DNS is updated after the new regionserver pod has already started. Unfortunately, I can't verify if that's the case as DNS isn't managed by us so I can't add extra logging there. I think for the sake of argument we can assume that the DNS mapping is inconsistent. Although, that could be the case on any infra as DNS can be inconsistent due to caching in multiple places (systemd-resolved, intermediate DNS servers, java dns cache, etc) or just operators being slow to update them. > What happens if you run w/ -Dsun.net.inetaddr.ttl=1 instead of 10? I was able to reproduce this issue with ttl=1 as well as ttl=0 (which I guess means no caching). > [k8s] The crashed node exists in onlineServer forever, and if it holds the > meta data, master will start up hang. > > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, hbasemaster.log, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to >
[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI
[ https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118895#comment-17118895 ] Andrey Elenskiy commented on HBASE-24438: - I guess the issue is "staleness". The procedures clearly finished during the reign of previously active master. However, the new master still shows the ServerCrashProcedures as running for multiple hours in "Task" -> "show non-RPC Tasks" UI. Let me know if I should rephrase the issue/provide more info to help you identify the root cause. > Stale ServerCrashProcedure task in HBase Master UI > -- > > Key: HBASE-24438 > URL: https://issues.apache.org/jira/browse/HBASE-24438 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > HDFS 3.1.3 with erasure coding enabled > Kubernetes >Reporter: Andrey Elenskiy >Priority: Major > > Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with > ServerCrashProcedure after master failover. The procedures have finished with > SUCCESS on a previously active HBase master and aren't showing in "Procedures > & Locks". > Based on the logs, both of those regionserver were carrying hbase:meta (logs > are sorted newest first grepped for those specific servers that have stale > ServerCrashProcedures): > {noformat} > 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] > procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; > ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true in 2.5290sec > 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: removed crashed server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after > splitting done > 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > dead splitlog workers > [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: > Log dir for server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not > exist > 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions > 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: pid=38, > state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true found RIT pid=20, ppid=18, > state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > rit=ABNORMALLY_CLOSED, > location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > table=hbase:meta, region=1588230740 > 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in > [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting] > in 0ms > 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting > dir is empty, no logs to split. > 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > dead splitlog workers > [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: Start pid=38, > state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true > 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] > assignment.AssignmentManager: Scheduled SCP pid=38 for > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 > (carryingMeta=true) > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write > locks = 1, Read locks = 0], oldState=ONLINE. > 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: > Processing expiration of > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on > hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366 > 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] > master.RegionServerTracker: RegionServer ephemeral node deleted, processing > expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > ... > 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] >
[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI
[ https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116916#comment-17116916 ] Andrey Elenskiy commented on HBASE-24438: - Looks like the showing of ServerCrashProcedure tasks in the UI was introduced in https://issues.apache.org/jira/browse/HBASE-21647 > Stale ServerCrashProcedure task in HBase Master UI > -- > > Key: HBASE-24438 > URL: https://issues.apache.org/jira/browse/HBASE-24438 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > HDFS 3.1.3 with erasure coding enabled > Kubernetes >Reporter: Andrey Elenskiy >Priority: Major > > Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with > ServerCrashProcedure after master failover. The procedures have finished with > SUCCESS on a previously active HBase master and aren't showing in "Procedures > & Locks". > Based on the logs, both of those regionserver were carrying hbase:meta (logs > are sorted newest first grepped for those specific servers that have stale > ServerCrashProcedures): > {noformat} > 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] > procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; > ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true in 2.5290sec > 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: removed crashed server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after > splitting done > 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > dead splitlog workers > [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: > Log dir for server > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not > exist > 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions > 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: pid=38, > state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true found RIT pid=20, ppid=18, > state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > rit=ABNORMALLY_CLOSED, > location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > table=hbase:meta, region=1588230740 > 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in > [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting] > in 0ms > 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting > dir is empty, no logs to split. > 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: > dead splitlog workers > [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] > procedure.ServerCrashProcedure: Start pid=38, > state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure > server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, > splitWal=true, meta=true > 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] > assignment.AssignmentManager: Scheduled SCP pid=38 for > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 > (carryingMeta=true) > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write > locks = 1, Read locks = 0], oldState=ONLINE. > 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: > Processing expiration of > regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on > hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366 > 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] > master.RegionServerTracker: RegionServer ephemeral node deleted, processing > expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] > ... > 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] > procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; > ServerCrashProcedure > server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, > splitWal=true, meta=true in 997msec > 2020-05-21
[jira] [Created] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI
Andrey Elenskiy created HBASE-24438: --- Summary: Stale ServerCrashProcedure task in HBase Master UI Key: HBASE-24438 URL: https://issues.apache.org/jira/browse/HBASE-24438 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.2.4 Environment: HBase 2.2.4 HDFS 3.1.3 with erasure coding enabled Kubernetes Reporter: Andrey Elenskiy Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with ServerCrashProcedure after master failover. The procedures have finished with SUCCESS on a previously active HBase master and aren't showing in "Procedures & Locks". Based on the logs, both of those regionserver were carrying hbase:meta (logs are sorted newest first grepped for those specific servers that have stale ServerCrashProcedures): {noformat} 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true in 2.5290sec 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: removed crashed server regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after splitting done 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: dead splitlog workers [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: Log dir for server regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not exist 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: pid=38, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true found RIT pid=20, ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; rit=ABNORMALLY_CLOSED, location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, table=hbase:meta, region=1588230740 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting] in 0ms 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting dir is empty, no logs to split. 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: dead splitlog workers [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: Start pid=38, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] assignment.AssignmentManager: Scheduled SCP pid=38 for regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 (carryingMeta=true) regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write locks = 1, Read locks = 0], oldState=ONLINE. 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: Processing expiration of regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347] ... 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; ServerCrashProcedure server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, splitWal=true, meta=true in 997msec 2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27] procedure.ServerCrashProcedure: removed crashed server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after splitting done 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager: dead splitlog workers [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010] 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager: Log dir for server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does
[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114447#comment-17114447 ] Andrey Elenskiy commented on HBASE-22041: - > Oh, I bet HDFS gets confused too... (but maybe not – IIRC, it creates' a name > to use referring to the DN...) Let me check logs. So for hadoop we route clients by hostnames (dfs.client.use.datanode.hostname) and provision a k8s service per datanode which results in a stable IP per datanode. That's a workaround to the bug (https://issues.apache.org/jira/browse/HDFS-15250), which I don't think was properly addressed there. Otherwise, we could have just relied on hostnames of the pods without needing a service. (During the pod restart, its hostname is also removed from DNS resulting in UnresolvedHostnameException for clients). The most ideal for hbase on k8s would be to not cache any IPs (stateless connections) and not rely on hostnames (kinda like kafka brokers) but that's probably not easy to change. > [k8s] The crashed node exists in onlineServer forever, and if it holds the > meta data, master will start up hang. > > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, hbasemaster.log, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to >
[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113423#comment-17113423 ] Andrey Elenskiy commented on HBASE-22041: - Attached entire hbasemaster log (hbasemaster.log) with TRACE enabled right before trying to reproduce the issue. The time I've triggered the issue was "Thu May 21 17:28:42 UTC 2020". And the topology looked like so: {noformat} hbasemaster-0 10.128.25.30 hbasemaster-1 10.128.6.51 regionserver-0 10.128.53.53 regionserver-1 10.128.9.37 regionserver-2 10.128.14.39{noformat} They way I trigger the issue is by picking a regionserver with 0 regions (because it was restarted recently), triggering "balancer" and killing the regionserver during the execution of balancer. In this case the regionserver I killed was regionserver-2. Here's how topology looked like after regionserver 2 came back up: {noformat} hbasemaster-0 10.128.25.30 hbasemaster-1 10.128.6.51 regionserver-0 10.128.53.53 regionserver-1 10.128.9.37 regionserver-2 10.128.14.40{noformat} You can see that regionserver-2 came back up with IP 10.128.14.40, but hbasemaster still tries to contact 10.128.14.39 > [k8s] The crashed node exists in onlineServer forever, and if it holds the > meta data, master will start up hang. > > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, hbasemaster.log, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN
[jira] [Updated] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-22041: Attachment: hbasemaster.log > [k8s] The crashed node exists in onlineServer forever, and if it holds the > meta data, master will start up hang. > > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, hbasemaster.log, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying... > 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112500#comment-17112500 ] Andrey Elenskiy commented on HBASE-22041: - > It starts after the container comes back w/ new IP (looks the same though > across textboxes)? Right, the pod starts with new IP address, but for some reason master is still trying to reach old IP address > What are the dns timeouts on this host? We set -Dsun.net.inetaddr.ttl=10 option and it seems to work for other places. Is it necessary to set "networkaddress.cache.tt" option as well? > The crashed node exists in onlineServer forever, and if it holds the meta > data, master will start up hang. > -- > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying... > 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to >
[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110623#comment-17110623 ] Andrey Elenskiy commented on HBASE-22041: - Just reproduced again and I'm seeing ServerCrashProcedure being stuck for the regionserver that it's trying to reconnect to with state=WAITING:SERVER_CRASH_FINISH. And, ServerCrashProcedure is waiting for TransitRegionStateProcedure procedure with state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED. And, TransitRegionStateProcedure is waiting for OpenRegionProcedure procedure with state=RUNNABLE Regions in transition are in OPENNING state for the regionserver that exists. If I'm understanding the logs correctly, it's trying to connect to old IP address for the restarted regionserver. In kubernetes when pod is restarted it gets a new IP address and preserves hostname (if it's a statefulset). So, there's some assumption somewhere in HBase that IP address doesn't change or it caches the IP address resolution. In this particular case it looks like it's trying to correctly assign regions to the online regionserver but still uses old IP address. > The crashed node exists in onlineServer forever, and if it holds the meta > data, master will start up hang. > -- > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] >
[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110601#comment-17110601 ] Andrey Elenskiy commented on HBASE-22041: - Here's the error that shows that the address is getting readded to failed servers: {code:java} 2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] procedure.RSProcedureDispatcher: Building request with operations count=1 2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] ipc.NettyRpcConnection: Connecting to regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 2020-05-18 17:52:20,254 TRACE [RS-EventLoopGroup-1-1] ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 5ms 2020-05-18 17:52:20,254 DEBUG [RS-EventLoopGroup-1-1] ipc.FailedServers: Added failed server with address regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 to list caused by org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 2020-05-18 17:52:20,254 DEBUG [RSProcedureDispatcher-pool3-t133] procedure.RSProcedureDispatcher: request to regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed, try=1480 2020-05-18 17:52:20,255 WARN [RSProcedureDispatcher-pool3-t133] procedure.RSProcedureDispatcher: request to server regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed due to java.net.ConnectException: Call to regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020, try=1480, retrying... ... 10 more Caused by: java.net.ConnectException: finishConnect(..) failed: No route to host ... 6 more at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:644) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:667) at org.apache.hbase.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:269) at org.apache.hbase.thirdparty.io.netty.channel.unix.Errors.throwConnectException(Errors.java:112) Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 at java.lang.Thread.run(Thread.java:748) at org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:328) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:631) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502) at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:261) at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:267) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179) at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:924) at
[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110500#comment-17110500 ] Andrey Elenskiy commented on HBASE-22041: - We are seeing the same issue on 2.2.4 running in kubernetes. The issue appears to be do to with the fact that address of failed regionserver keeps getting readded to failed servers list when RSProcedureDispatcher.sendRequest is called. Here's with TRACE logging enabled: {code:java} 2020-05-18 17:52:19,643 TRACE [RSProcedureDispatcher-pool3-t127] procedure.RSProcedureDispatcher: Building request with operations count=1 2020-05-18 17:52:19,644 DEBUG [RSProcedureDispatcher-pool3-t127] ipc.AbstractRpcClient: Not trying to connect to regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 this server is in the failed servers list 2020-05-18 17:52:19,644 TRACE [RSProcedureDispatcher-pool3-t127] ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms 2020-05-18 17:52:19,644 DEBUG [RSProcedureDispatcher-pool3-t127] procedure.RSProcedureDispatcher: request to regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed, try=1474 org.apache.hadoop.hbase.ipc.FailedServerException: Call to regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585) at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 at org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433) ... 9 more 2020-05-18 17:52:19,644 WARN [RSProcedureDispatcher-pool3-t127] procedure.RSProcedureDispatcher: request to server regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020, try=1474, retrying...{code} In our case it doesn't recover automatically and we have to restart hbase master to get out of this issue. > The crashed node exists in onlineServer forever, and if it holds the meta > data, master will start up hang. > -- > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug >Reporter: lujie >Priority: Critical > Attachments: bug.zip, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs
[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
[ https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100334#comment-17100334 ] Andrey Elenskiy commented on HBASE-24273: - Great, thanks for a quick fix! > HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles > > > Key: HBASE-24273 > URL: https://issues.apache.org/jira/browse/HBASE-24273 > Project: HBase > Issue Type: Bug > Components: hbck2 >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > Hadoop 3.1.3 >Reporter: Andrey Elenskiy >Priority: Critical > Fix For: 3.0.0-alpha-1, 2.3.0 > > > This issue came up after merging regions. MergeTableRegionsProcedure removes > the parent regions from hbase:meta and creates HFile references in child > region to the old parent regions. Running `hbck_chore_run` right after the > `merge_region` will show the parent regions in "Orphan Regions on FileSystem" > until major compaction is run on child region which will remove HFile > references and cause Catalog Janitor to clean up the parent regions. > There are probably other situations which can cause the same issue (maybe > region split?) > Having "Orphan Regions on FileSystem" list parent regions and suggest to > "_hbase completebulkload_" is dangerous in this case as completing bulk load > will lead to stale HFile references in child region which will cause its OPEN > to fail because referenced HFile doesn't exist. > Figuring out these things for database administrators is tedious, so I think > it would be reasonable to not consider regions with referenced HFiles to be > orphans (or maybe could give an extra hint saying that it has referenced > HFiles). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans
[ https://issues.apache.org/jira/browse/HBASE-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095956#comment-17095956 ] Andrey Elenskiy commented on HBASE-24255: - [~huaxiangsun] yes, you got the idea right. > but somehow the merge*** qualifers were not cleaned up from new merged child > region in meta table (maybe master crashed before > GCMultipleMergedRegionsProcedure is started) That's due to HBASE-24273 actually, addMissingRegionsInMeta will read those "orphans" without checking that merge qualifier exists. I think fixing HBASE-24273 will resolve this particular instance. But I'm still wondering if there are other situations where GCRegionProcedure should also make sure that region is unassigned from regionserver and it would be more geneirc as I've seen it happen even without region merges (I don't recall the case anymore). > GCRegionProcedure doesn't assign region from RegionServer leading to orphans > > > Key: HBASE-24255 > URL: https://issues.apache.org/jira/browse/HBASE-24255 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment, regionserver >Affects Versions: 2.2.4 > Environment: hbase 2.2.4 > hadoop 3.1.3 >Reporter: Andrey Elenskiy >Assignee: niuyulin >Priority: Major > > We've found ourselves in a situation where parents of merged or split regions > needed to be opened again on a regionserver due to having to recover from > cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure > which requiters all regions to be merged to be open). Then, when a > GCProcedure is kicked of to clean a parent region up by > GCMultipleMergedRegionsProcedure, it ends up deleting it from hbase:meta, but > doesn't unassign it from RegionServer leading for it to show up in "Orphan > Regions on RegionServer" in hbck tab of HBase Master. Also, the hbase client > doesn't detect that the region is closed either because it's still > technically open on a regionserver (it doesn't reread hbase:meta all the > time). The only way to recover from this is to restart regionserver which > isn't idea as it can lead to other issues in clusters with region > inconsistencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans
[ https://issues.apache.org/jira/browse/HBASE-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094943#comment-17094943 ] Andrey Elenskiy commented on HBASE-24255: - I don't really see how that addresses the issue in description. The problem is I was trying to describe can happen if I were to run HBCK2's addMissingRegionsInMeta which ends up readding parents of merged region into meta and assigns it to a RegionServer. Then, when GCRegionProcedure runs, it removes the region from hbase:meta and FS, but doesn't unassign the region from regionsserver. Hence, I'd like to see that GCRegionProcedure actually makes sure that the region is not assigned on any regionserver (leading to "Orphan Regions on RegionServer"). > GCRegionProcedure doesn't assign region from RegionServer leading to orphans > > > Key: HBASE-24255 > URL: https://issues.apache.org/jira/browse/HBASE-24255 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment, regionserver >Affects Versions: 2.2.4 > Environment: hbase 2.2.4 > hadoop 3.1.3 >Reporter: Andrey Elenskiy >Assignee: niuyulin >Priority: Major > > We've found ourselves in a situation where parents of merged or split regions > needed to be opened again on a regionserver due to having to recover from > cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure > which requiters all regions to be merged to be open). Then, when a > GCProcedure is kicked of to clean a parent region up by > GCMultipleMergedRegionsProcedure, it ends up deleting it from hbase:meta, but > doesn't unassign it from RegionServer leading for it to show up in "Orphan > Regions on RegionServer" in hbck tab of HBase Master. Also, the hbase client > doesn't detect that the region is closed either because it's still > technically open on a regionserver (it doesn't reread hbase:meta all the > time). The only way to recover from this is to restart regionserver which > isn't idea as it can lead to other issues in clusters with region > inconsistencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region
[ https://issues.apache.org/jira/browse/HBASE-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094713#comment-17094713 ] Andrey Elenskiy commented on HBASE-24250: - > the critical problem is why GCMultipleMergedRegionsProcedure not work It did work, there were just so many of them because catalog janitor kept resubmitting it for the same regions over and over again. > and then something when wrong and caused a different procedure to stall We had an issue with running out of direct memory causing regionservers to crashloop and master to keep resubmitting region assignment. I think it's a separate issue from what I've addressed here. > CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region > - > > Key: HBASE-24250 > URL: https://issues.apache.org/jira/browse/HBASE-24250 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.2.4 > Environment: hdfs 3.1.3 with erasure coding > hbase 2.2.4 >Reporter: Andrey Elenskiy >Assignee: niuyulin >Priority: Major > > If a lot of regions were merged (due to change of region sizes, for example), > there can be a long backlog of procedures to clean up the merged regions. If > going through this backlog is slower than the CatalogJanitor's scan interval, > it will end resubmitting GCMultipleMergedRegionsProcedure for the same > regions over and over again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
[ https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094048#comment-17094048 ] Andrey Elenskiy commented on HBASE-24273: - Hm, interesting that you say that the parent regions should be repopulated from hbase:meta and maybe that's the main root cause of issue here. MergeTableRegionsProcedure calls updateMetaForMergedRegions in MERGE_TABLE_REGIONS_UPDATE_META state. That function calls AssignmentManager.markRegionAsMerged which actually deletes region from hase:meta ([https://github.com/apache/hbase/blob/346d087f409f9b44754d1d4426492c1ecd02ea89/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1866] which eventually calls MetaTableAccessor.mergeRegions which sends a delete to hbase:meta). So, that's why I observe that orphans in fs are around until major compaction is complete and catalog janitor GCs them. If you say that it should repopulate from hbase:meta, maybe it shouldn't remove it from there? > HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles > > > Key: HBASE-24273 > URL: https://issues.apache.org/jira/browse/HBASE-24273 > Project: HBase > Issue Type: Bug > Components: hbck2 >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > Hadoop 3.1.3 >Reporter: Andrey Elenskiy >Priority: Critical > Fix For: 3.0.0, 2.3.0 > > > This issue came up after merging regions. MergeTableRegionsProcedure removes > the parent regions from hbase:meta and creates HFile references in child > region to the old parent regions. Running `hbck_chore_run` right after the > `merge_region` will show the parent regions in "Orphan Regions on FileSystem" > until major compaction is run on child region which will remove HFile > references and cause Catalog Janitor to clean up the parent regions. > There are probably other situations which can cause the same issue (maybe > region split?) > Having "Orphan Regions on FileSystem" list parent regions and suggest to > "_hbase completebulkload_" is dangerous in this case as completing bulk load > will lead to stale HFile references in child region which will cause its OPEN > to fail because referenced HFile doesn't exist. > Figuring out these things for database administrators is tedious, so I think > it would be reasonable to not consider regions with referenced HFiles to be > orphans (or maybe could give an extra hint saying that it has referenced > HFiles). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
[ https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094016#comment-17094016 ] Andrey Elenskiy commented on HBASE-24273: - But if master is restarted, in-memory state will be lost and we'd end up in the same situation which is unexpected IMHO. It will probalby be more reliable to actual cross reference the state against the FS given that the the category name is called "Orphan regions *on FileSystem*". HbckChore.loadRegionsFromFS looks like the place where the logic can be added to count back references. > HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles > > > Key: HBASE-24273 > URL: https://issues.apache.org/jira/browse/HBASE-24273 > Project: HBase > Issue Type: Bug > Components: hbck2 >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > Hadoop 3.1.3 >Reporter: Andrey Elenskiy >Priority: Critical > Fix For: 3.0.0, 2.3.0 > > > This issue came up after merging regions. MergeTableRegionsProcedure removes > the parent regions from hbase:meta and creates HFile references in child > region to the old parent regions. Running `hbck_chore_run` right after the > `merge_region` will show the parent regions in "Orphan Regions on FileSystem" > until major compaction is run on child region which will remove HFile > references and cause Catalog Janitor to clean up the parent regions. > There are probably other situations which can cause the same issue (maybe > region split?) > Having "Orphan Regions on FileSystem" list parent regions and suggest to > "_hbase completebulkload_" is dangerous in this case as completing bulk load > will lead to stale HFile references in child region which will cause its OPEN > to fail because referenced HFile doesn't exist. > Figuring out these things for database administrators is tedious, so I think > it would be reasonable to not consider regions with referenced HFiles to be > orphans (or maybe could give an extra hint saying that it has referenced > HFiles). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24189) Regionserver recreates region folders in HDFS after replaying WAL with removed table entries
[ https://issues.apache.org/jira/browse/HBASE-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093922#comment-17093922 ] Andrey Elenskiy commented on HBASE-24189: - Haven't planned on a patch, not exactly certain what would be the right solution here without leading to data loss. > But what if the table is deleted and not recreated. I haven't actually tested this, it could be the same case or maybe regionserver actually checks that the table doesn't exist anymore so it doesn't create directories. > We might have to check whether the region exists or not also as part of the > last flushed seqId look up and if the regions does not exists at all, we > might have to just ignore those entries from WAL. Would this be a safe thing to do? I'm not familiar with edge cases, but what would happen if WAL isn't flushed before the region is removed, it might cause data loss? For example, if region is split or merged and WAL isn't flushed prior to opening child region and closing parent regions (I don't know if it always gets flushed in those cases), then GCRegionProcedure will remove the parent regions and if there are still edits in WAL for parent regions that should be replayed into child region instead of getting discarded. > Regionserver recreates region folders in HDFS after replaying WAL with > removed table entries > > > Key: HBASE-24189 > URL: https://issues.apache.org/jira/browse/HBASE-24189 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.4 > Environment: * HDFS 3.1.3 > * HBase 2.1.4 > * OpenJDK 8 >Reporter: Andrey Elenskiy >Assignee: Anoop Sam John >Priority: Major > > Under the following scenario region directories in HDFS can be recreated with > only recovered.edits in them: > # Create table "test" > # Put into "test" > # Delete table "test" > # Create table "test" again > # Crash the regionserver to which the put has went to force the WAL replay > # Region directory in old table is recreated in new table > # hbase hbck returns inconsistency > This appears to happen due to the fact that WALs are not cleaned up once a > table is deleted and they still contain the edits from old table. I've tried > wal_roll command on the regionserver before crashing it, but it doesn't seem > to help as under some circumstances there are still WAL files around. The > only solution that works consistently is to restart regionserver before > creating the table at step 4 because that triggers log cleanup on startup: > [https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508|https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508)] > > Truncating a table also would be a workaround by in our case it's a no-go as > we create and delete tables in our tests which run back to back (create table > in the beginning of the test and delete in the end of the test). > A nice option in our case would be to provide hbase shell utility to force > clean up of log files manually as I realize that it's not really viable to > clean all of those up every time some table is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region
[ https://issues.apache.org/jira/browse/HBASE-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093918#comment-17093918 ] Andrey Elenskiy commented on HBASE-24250: - Yes, GCMultipleMergedRegionsProcedure and GCRegionProcedure are idempotent. However, the piling up of those can lead to pretty annoying situation of having hbasemaster to churn through all of them before proceeding to actually useful procedures. For example, in our cluster we ended up merging over 700 regions and then something when wrong and caused a different procedure to stall (which unfortunately happens more often then I would like to). As we didn't notice the issue of stalled procedure right away, we ended up with over 20k GCMultipleMergedRegionsProcedure in the backlog. It was quite tedious to figure out why we have so many of those, figure out that we need to disable catalog janitor, to bypass all of the CG procedures via HBCK2, and then get to actually fixing the stalled procedure. This caused pretty long downtime for the entire cluster. > CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region > - > > Key: HBASE-24250 > URL: https://issues.apache.org/jira/browse/HBASE-24250 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.2.4 > Environment: hdfs 3.1.3 with erasure coding > hbase 2.2.4 >Reporter: Andrey Elenskiy >Assignee: niuyulin >Priority: Major > > If a lot of regions were merged (due to change of region sizes, for example), > there can be a long backlog of procedures to clean up the merged regions. If > going through this backlog is slower than the CatalogJanitor's scan interval, > it will end resubmitting GCMultipleMergedRegionsProcedure for the same > regions over and over again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
[ https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-24273: Description: This issue came up after merging regions. MergeTableRegionsProcedure removes the parent regions from hbase:meta and creates HFile references in child region to the old parent regions. Running `hbck_chore_run` right after the `merge_region` will show the parent regions in "Orphan Regions on FileSystem" until major compaction is run on child region which will remove HFile references and cause Catalog Janitor to clean up the parent regions. There are probably other situations which can cause the same issue (maybe region split?) Having "Orphan Regions on FileSystem" list parent regions and suggest to "_hbase completebulkload_" is dangerous in this case as completing bulk load will lead to stale HFile references in child region which will cause its OPEN to fail because referenced HFile doesn't exist. Figuring out these things for database administrators is tedious, so I think it would be reasonable to not consider regions with referenced HFiles to be orphans (or maybe could give an extra hint saying that it has referenced HFiles). was: This issue came up after merging regions. MergeTableRegionsProcedure removes the parent regions from hbase:meta and creates HFile references in child region to the old parent regions. Running `hbck_chore_run` right after the `merge_region` will show the parent regions in "Orphan Regions on FileSystem" until major compaction is run on child region which will remove HFile references and cause Catalog Janitor to clean up the parent regions. There are probably other situations which can cause the same issue (maybe region split?) Having "Orphan Regions on FileSystem" list parent regions and suggest to "_hbase completebulkload_" is dangerous in this case as completing bulk load in this case will lead to stale HFile references in child region which will cause it's OPEN to fail because referenced HFile doesn't exist. Figuring out these things for database administrators is tedious, so I think it would be reasonable to not consider regions with referenced HFiles to be orphans (or maybe could give an extra hint saying that it has referenced HFiles). > HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles > > > Key: HBASE-24273 > URL: https://issues.apache.org/jira/browse/HBASE-24273 > Project: HBase > Issue Type: Bug > Components: hbck2 >Affects Versions: 2.2.4 > Environment: HBase 2.2.4 > Hadoop 3.1.3 >Reporter: Andrey Elenskiy >Priority: Critical > Fix For: 3.0.0, 2.3.0 > > > This issue came up after merging regions. MergeTableRegionsProcedure removes > the parent regions from hbase:meta and creates HFile references in child > region to the old parent regions. Running `hbck_chore_run` right after the > `merge_region` will show the parent regions in "Orphan Regions on FileSystem" > until major compaction is run on child region which will remove HFile > references and cause Catalog Janitor to clean up the parent regions. > There are probably other situations which can cause the same issue (maybe > region split?) > Having "Orphan Regions on FileSystem" list parent regions and suggest to > "_hbase completebulkload_" is dangerous in this case as completing bulk load > will lead to stale HFile references in child region which will cause its OPEN > to fail because referenced HFile doesn't exist. > Figuring out these things for database administrators is tedious, so I think > it would be reasonable to not consider regions with referenced HFiles to be > orphans (or maybe could give an extra hint saying that it has referenced > HFiles). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
Andrey Elenskiy created HBASE-24273: --- Summary: HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles Key: HBASE-24273 URL: https://issues.apache.org/jira/browse/HBASE-24273 Project: HBase Issue Type: Bug Components: hbck2 Affects Versions: 2.2.4 Environment: HBase 2.2.4 Hadoop 3.1.3 Reporter: Andrey Elenskiy This issue came up after merging regions. MergeTableRegionsProcedure removes the parent regions from hbase:meta and creates HFile references in child region to the old parent regions. Running `hbck_chore_run` right after the `merge_region` will show the parent regions in "Orphan Regions on FileSystem" until major compaction is run on child region which will remove HFile references and cause Catalog Janitor to clean up the parent regions. There are probably other situations which can cause the same issue (maybe region split?) Having "Orphan Regions on FileSystem" list parent regions and suggest to "_hbase completebulkload_" is dangerous in this case as completing bulk load in this case will lead to stale HFile references in child region which will cause it's OPEN to fail because referenced HFile doesn't exist. Figuring out these things for database administrators is tedious, so I think it would be reasonable to not consider regions with referenced HFiles to be orphans (or maybe could give an extra hint saying that it has referenced HFiles). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans
Andrey Elenskiy created HBASE-24255: --- Summary: GCRegionProcedure doesn't assign region from RegionServer leading to orphans Key: HBASE-24255 URL: https://issues.apache.org/jira/browse/HBASE-24255 Project: HBase Issue Type: Bug Components: proc-v2, Region Assignment, regionserver Affects Versions: 2.2.4 Environment: hbase 2.2.4 hadoop 3.1.3 Reporter: Andrey Elenskiy We've found ourselves in a situation where parents of merged or split regions needed to be opened again on a regionserver due to having to recover from cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure which requiters all regions to be merged to be open). Then, when a GCProcedure is kicked of to clean a parent region up by GCMultipleMergedRegionsProcedure, it ends up deleting it from hbase:meta, but doesn't unassign it from RegionServer leading for it to show up in "Orphan Regions on RegionServer" in hbck tab of HBase Master. Also, the hbase client doesn't detect that the region is closed either because it's still technically open on a regionserver (it doesn't reread hbase:meta all the time). The only way to recover from this is to restart regionserver which isn't idea as it can lead to other issues in clusters with region inconsistencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region
Andrey Elenskiy created HBASE-24250: --- Summary: CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region Key: HBASE-24250 URL: https://issues.apache.org/jira/browse/HBASE-24250 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.2.4 Environment: hdfs 3.1.3 with erasure coding hbase 2.2.4 Reporter: Andrey Elenskiy If a lot of regions were merged (due to change of region sizes, for example), there can be a long backlog of procedures to clean up the merged regions. If going through this backlog is slower than the CatalogJanitor's scan interval, it will end resubmitting GCMultipleMergedRegionsProcedure for the same regions over and over again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24189) Regionserver recreates region folders in HDFS after replaying WAL with removed table entries
Andrey Elenskiy created HBASE-24189: --- Summary: Regionserver recreates region folders in HDFS after replaying WAL with removed table entries Key: HBASE-24189 URL: https://issues.apache.org/jira/browse/HBASE-24189 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 2.2.4 Environment: * HDFS 3.1.3 * HBase 2.1.4 * OpenJDK 8 Reporter: Andrey Elenskiy Under the following scenario region directories in HDFS can be recreated with only recovered.edits in them: # Create table "test" # Put into "test" # Delete table "test" # Create table "test" again # Crash the regionserver to which the put has went to force the WAL replay # Region directory in old table is recreated in new table # hbase hbck returns inconsistency This appears to happen due to the fact that WALs are not cleaned up once a table is deleted and they still contain the edits from old table. I've tried wal_roll command on the regionserver before crashing it, but it doesn't seem to help as under some circumstances there are still WAL files around. The only solution that works consistently is to restart regionserver before creating the table at step 4 because that triggers log cleanup on startup: [https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508|https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508)] Truncating a table also would be a workaround by in our case it's a no-go as we create and delete tables in our tests which run back to back (create table in the beginning of the test and delete in the end of the test). A nice option in our case would be to provide hbase shell utility to force clean up of log files manually as I realize that it's not really viable to clean all of those up every time some table is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861636#comment-16861636 ] Andrey Elenskiy commented on HBASE-21476: - Hello, I'd like to restart trying to get this merged as we've been running this change in production for some time now and upgrades are tough because we need to recompile every time. However, I'm not a fan of all the "if" checks that I've added around and would like to make the implementation a bit more flexible. What do you think about making java.time.Clock to be a field of HRegion (so each region would have its own clock as described here: [https://docs.google.com/a/arista.com/document/d/1LL2GAodiYi0waBz5ODGL4LDT4e_bXy8P9h6kWC05Bhw/edit?disco=AQ5zZuM])? Then most of the places within HRegion would use java.time.Instant and occasionally use a public helper function in HRegion that would convert Insant to either millisecond or nanoseconds (depending if region's table had NANOSECOND_TIMESTAMPS attribute). The nice thing about this is that nanosecond vs millisecond decisions would be contained to a single function and there's no need to pass around "isNanosecondTimestamps" everywhere. StoreScanner and various compaction routines would also use region's clock (instead of EnvironmentEdge) to make decisions which is also a good side effect. This approach can also be implemented in iterative steps. > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, > nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Labels: NEW_VERSION_BEHAVIOR query (was: ) > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 2.1.2 >Reporter: Andrey Elenskiy >Priority: Major > Labels: NEW_VERSION_BEHAVIOR, query > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Description: We've enabled NEW_VERSION_BEHAVIOR on our cluster that has moderate amount of tiny scan requests in parallel and noticed that the heap grows to max (10Gi) causing GC (CMS) to kick in and significantly slowing down execution of the regionserver. This can be reproduced by doing 50+ scan requests on a single row in parallel. The heap usage goes down once the requests finish. Looking at NewVersionBehaviorTracker, it allocates 2 TreeMap and a gazillion of private fields for every scan request. I haven't profiled the cause of this memory bomb, but would guess that NewVersionBehaviorTracker is not a small object to allocate so often. Let me know if I can provide additional information. was:We've enabled > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 2.1.2 >Reporter: Andrey Elenskiy >Priority: Major > Labels: NEW_VERSION_BEHAVIOR, query > > We've enabled NEW_VERSION_BEHAVIOR on our cluster that has moderate amount of > tiny scan requests in parallel and noticed that the heap grows to max (10Gi) > causing GC (CMS) to kick in and significantly slowing down execution of the > regionserver. This can be reproduced by doing 50+ scan requests on a single > row in parallel. The heap usage goes down once the requests finish. > Looking at NewVersionBehaviorTracker, it allocates 2 TreeMap and a gazillion > of private fields for every scan request. I haven't profiled the cause of > this memory bomb, but would guess that NewVersionBehaviorTracker is not a > small object to allocate so often. > Let me know if I can provide additional information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Description: We've enabled > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 2.1.2 >Reporter: Andrey Elenskiy >Priority: Major > Labels: NEW_VERSION_BEHAVIOR, query > > We've enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Summary: NEW_VERSION_BEHAVIOR blows up the heap (was: NEW) > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Umbrella >Reporter: Andrey Elenskiy >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Component/s: regionserver > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 2.1.2 >Reporter: Andrey Elenskiy >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Affects Version/s: 2.1.2 > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Andrey Elenskiy >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap
[ https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21882: Issue Type: Bug (was: Umbrella) > NEW_VERSION_BEHAVIOR blows up the heap > -- > > Key: HBASE-21882 > URL: https://issues.apache.org/jira/browse/HBASE-21882 > Project: HBase > Issue Type: Bug >Reporter: Andrey Elenskiy >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21882) NEW
Andrey Elenskiy created HBASE-21882: --- Summary: NEW Key: HBASE-21882 URL: https://issues.apache.org/jira/browse/HBASE-21882 Project: HBase Issue Type: Umbrella Reporter: Andrey Elenskiy -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21476: Attachment: HBASE-21476.branch-2.1.0004.patch > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, > nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733655#comment-16733655 ] Andrey Elenskiy commented on HBASE-21476: - Added "-Dhbase.tests.nanosecond.timestamps" to run the existing tests that are using HBaseTestingUtility with NANOSECOND_TIMESTAMPS table attribute. Would be great if someone could trigger the build with this flag since some tests (TestClientClusterMetrics and TestNettyIPC) timeout on my machine preventing from running other tests. As for bulk imports, I don't quite know what could be updated as it's the same problem: it's up to the client to be aware what they are importing into what version of a table. It's the client that specifies the timestamps. > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, nanosecond_timestamps_v1.patch, > nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21476: Attachment: HBASE-21476.branch-2.1.0003.patch > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, nanosecond_timestamps_v1.patch, > nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729981#comment-16729981 ] Andrey Elenskiy commented on HBASE-21545: - Thanks for reviewing and merging! I've created couple more followup issues: https://issues.apache.org/jira/browse/HBASE-21654 https://issues.apache.org/jira/browse/HBASE-21653 > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: NEW_VERSION_BEHAVIOR > Fix For: 3.0.0, 2.2.0, 2.1.2, 2.0.4 > > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch, Screen > Shot 2018-12-24 at 10.04.57 AM.png > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21654) Sort through tests that fail with NEW_VERSION_BEHAVIOR is enabled
Andrey Elenskiy created HBASE-21654: --- Summary: Sort through tests that fail with NEW_VERSION_BEHAVIOR is enabled Key: HBASE-21654 URL: https://issues.apache.org/jira/browse/HBASE-21654 Project: HBase Issue Type: Umbrella Components: integration tests, regionserver Affects Versions: 2.0.0 Reporter: Andrey Elenskiy "-Dhbase.tests.new.version.behavior=true" flag was added in https://issues.apache.org/jira/browse/HBASE-21545 which reruns all the integration tests with NEW_VERSION_BEHAVIOR enabled. Some tests failed either due to tests checking the old behavior or due to other new bug with NEW_VERSION_BEHAVIOR. So far the following test suites have failed: # TestKeepDeletes # TestMinVersions # TestExportSnapshot # TestSecureExportSnapshot # TestSyncTable # TestMobSecureExportSnapshot # TestThriftHBaseServiceHandler # TestThriftServer Some more discussion at https://issues.apache.org/jira/browse/HBASE-21545?focusedCommentId=16719510=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719510 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21653) NewVersionBehaviorTracker.checkVersions() should allow cell type to be DELETE
Andrey Elenskiy created HBASE-21653: --- Summary: NewVersionBehaviorTracker.checkVersions() should allow cell type to be DELETE Key: HBASE-21653 URL: https://issues.apache.org/jira/browse/HBASE-21653 Project: HBase Issue Type: Bug Components: API Affects Versions: 2.1.1, 2.0.0 Reporter: Andrey Elenskiy `MajorCompactionScanQueryMatcher.match()` states that "7. Delete marker need to be version counted together with puts they affect" which corresponds to a code path that can happen when KEEP_DELETED_CELLS is true. However, `NewVersionBehaviorTracker.checkVersions()` asserts that type cannot be DELETE. The AssertionError can be verified by running `TestKeepDeletes.testBasicScenario` with "-Dhbase.tests.new.version.behavior=true" after fixing it up to work as expected with NEW_VERSIONS_BEHAVIOR: {{--- a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestKeepDeletes.java}} {{+++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestKeepDeletes.java}} {{@@ -135,7 +135,7 @@ public class TestKeepDeletes {}} {{ g.setMaxVersions();}} {{ g.setTimeRange(0L, ts+2);}} {{ Result r = region.get(g);}} {{- checkResult(r, c0, c0, T2, T1);}} {{+ checkResult(r, c0, c0, T2);}} {{ // flush}} {{ region.flush(true);}} Some more info in the following comment: https://issues.apache.org/jira/browse/HBASE-21545?focusedCommentId=16719510=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719510 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16727181#comment-16727181 ] Andrey Elenskiy commented on HBASE-21545: - Create new request on reviewboard: https://reviews.apache.org/r/69624/ > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719532#comment-16719532 ] Andrey Elenskiy commented on HBASE-21545: - The assertion in checkVersions() seems to be another bug in NEW_VERSION_BEHAVIOR. Since NEW_VERSION_BEHAVIOR changes how versioning for deleted cells is accounted, we should expect to deleted cells to be checked. In fact it states so in "match()" of MajorCompactionScanQueryMatcher "7. Delete marker need to be version counted together with puts the affect". Should I open a new bug for this? Also what do you want to do about failed tests, should a new set of UTs be written for NEW_VERSION_BEHAVIOR where it differs with default one? > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719510#comment-16719510 ] Andrey Elenskiy commented on HBASE-21545: - Started debugging with TestKeepDeletes: TestKeepDeletes.testBasicScenario is testing old version behavior as expected results are different before and after flush(). TestKeepDeletes.testWithMinVersions seems to be also tailored for old version behavior as expected results is different after flush(), but we get the same one as described by NEW_VERSION_BEHAVIOR. TestKeepDeletes.testWithTTL again expects a different result after flush() but we get the same one as described by NEW_VERSION_BEHAVIOR. After fixing these tests to have the same expected result after flush(), the following tests are still failing because of an assertion "!PrivateCellUtil.isDelete(type)" in NewVersionBehaviorTracker.checkVersions() when region.compact(true) is called: TestKeepDeletes.testBasicScenario:148 TestKeepDeletes.testDeleteMarkerExpiration:506 TestKeepDeletes.testDeleteMarkerVersioning:711 TestKeepDeletes.testWithMinVersions:888 TestKeepDeletes.testWithOldRow:569 with java.lang.AssertionError at org.apache.hadoop.hbase.regionserver.querymatcher.NewVersionBehaviorTracker.checkVersions(NewVersionBehaviorTracker.java:305) at org.apache.hadoop.hbase.regionserver.querymatcher.MajorCompactionScanQueryMatcher.match(MajorCompactionScanQueryMatcher.java:80) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:586) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:387) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1407) Don't know if it's a new bug or it's supposed to behave this way or the test is wrongly structured. Will try to understand this assertion. This looks like it's not going to be easy to verify all the tests though, this is something that should have been done by original NEW_VERSION_BEHAVIOR contributors before this code has been merged. I think that the docs in hbase book should be updated with a warning. > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719330#comment-16719330 ] Andrey Elenskiy commented on HBASE-21545: - I also have those failing on my machine. So the tests could be failing either due to other bugs in NEW_VERSION_BEHAVIOR or due to the tests actually tailored to the old behavior. I'll will start sorting through them. > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-21545 started by Andrey Elenskiy. --- > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717870#comment-16717870 ] Andrey Elenskiy commented on HBASE-21545: - [~jatsakthi] looks like I uploaded wrong patch for 4, I've fixed the compilation error in patch 5 > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: HBASE-21545.branch-2.1.0005.patch > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710686#comment-16710686 ] Andrey Elenskiy commented on HBASE-21545: - I've uploaded a patch where I modified HBaseTestingUtility to set NEW_VERSION_BEHAVIOR attribute in integration tests when "-Dhbase.tests.new.version.behavior=true" option is passed. This way we validate that all tests pass with this attribute. Would be great if you could trigger a build. > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: HBASE-21545.branch-2.1.0004.patch > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, > HBASE-21545.branch-2.1.0004.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709497#comment-16709497 ] Andrey Elenskiy commented on HBASE-21545: - Ok, I've attached patch with the fix and a unit test for NewVersionBehaviorTracker with columns. Took opportunity to refactor checkColumn function to be a bit easier to follow (pretty much same as ExplicitColumnTracker). > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: HBASE-21545.branch-2.1.0003.patch > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Affects Version/s: 2.0.0 > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.0.0, 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy reassigned HBASE-21545: --- Assignee: Andrey Elenskiy (was: Sakthi) > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709411#comment-16709411 ] Andrey Elenskiy commented on HBASE-21545: - Got curious to learn and dig through the code. I believe I've found the issue for the bug. In hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/NewVersionBehaviorTracker.java is should be: {{ public boolean done() {}} {{- // lastCq* have been updated to this cell.}} {{- return !(columns == null || lastCqArray == null) && Bytes}} {{- .compareTo(lastCqArray, lastCqOffset, lastCqLength, columns[columnIndex], 0,}} {{- columns[columnIndex].length) > 0;}} {{+ return columnIndex >= columns.length;}} {{ }}} The reason it fails is because lastCq gets updated to the current cell while columnIndex hasn't been advanced from the already included column. Here's an example: Columns A, B and C are in the row. Get request with columns A and C. 1. {{lastCq*}} gets updated to A 2. checkColumn gets called on A, A gets included, columnIndex is on A 3. {{lastCq*}} gets updated to B 4. checkColumn calls done(), done checks if B > A (columnIndex is on A) which returns true 5. we are done with request > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Sakthi >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709353#comment-16709353 ] Andrey Elenskiy commented on HBASE-21545: - Oh, I'm not planning on working on a fix as I don't have enough knowledge about the moving pieces. [~busbey] asked for a reproduction test so I converted it :) On a side note, this seems like a bug that should have been caught by existing tests if they were converted to use new version behavior attribute. Do you think ti would be possible to run all the tests again with this option enabled by default and see what other bugs come out? > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Sakthi >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709331#comment-16709331 ] Andrey Elenskiy commented on HBASE-21476: - > Is there any way we could make this a hard enforcement? Maybe a new optional > way to flag that a client write has been done using nanoseconds? Yea, it would be useful as most developers probably look up code snippets online to talk to HBase and I can see someone ending up providing milliseconds even if they created table with NANOSECOND_TIMESTAMPS attribute. However, I don't think it's possible to do this 100% reliably. One way is to make an assumption that timestamp cannot be smaller than 2,000,000,000 (second second since epoch in nanos or year 2033 in millis), but that seems like a limiting assumption. We could add option to disabled this check though but then they could run into a situation when timestamp is small at a runtime and break their application with this exception. > we could proactively reject operations from old clients Old clients still work with this attribute, users would have to update code everywhere to use java.time.Instant so I don't really see how checking client version would help here. Another idea would be disallow altering table to set NANOSECOND_TIMESTAMPS attribute. The table has to be created with nanos from scratch. I assume that when the table created first time with this attribute, the users are starting to write a new application from scratch on dev environment. This would hopefully rule out the case of clients adding this attribute by accident to existing table and corrupting their data. > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: HBASE-21545.branch-2.1.0002.patch > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Sakthi >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709338#comment-16709338 ] Andrey Elenskiy commented on HBASE-21545: - ignore the first patch, the second patch has a test that reproduces the issue. > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Sakthi >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, > HBASE-21545.branch-2.1.0002.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: HBASE-21545.branch-2.1.0001.patch > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java, HBASE-21545.branch-2.1.0001.patch > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
Andrey Elenskiy created HBASE-21545: --- Summary: NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns Key: HBASE-21545 URL: https://issues.apache.org/jira/browse/HBASE-21545 Project: HBase Issue Type: New Feature Components: API Affects Versions: 2.1.1 Environment: HBase 2.1.1 Hadoop 2.8.4 Java 8 Reporter: Andrey Elenskiy Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one column to be returned when columns are specified in Scan or Get query. The result is always one first column by sorted order. I've attached a code snipped to reproduce the issue that can be converted into a test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Tags: get, scan, (was: get, scan) > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Tags: get, scan, NEW_VERSION_BEHAVIOR (was: get, scan, ) > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Issue Type: Bug (was: New Feature) > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: Bug > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Description: Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one column to be returned when columns are specified in Scan or Get query. The result is always one first column by sorted order. I've attached a code snipped to reproduce the issue that can be converted into a test. I've also validated with hbase shell and gohbase client, so it's gotta be server side issue. was:Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one column to be returned when columns are specified in Scan or Get query. The result is always one first column by sorted order. I've attached a code snipped to reproduce the issue that can be converted into a test. > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: New Feature > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. > I've also validated with hbase shell and gohbase client, so it's gotta be > server side issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
[ https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21545: Attachment: App.java > NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns > --- > > Key: HBASE-21545 > URL: https://issues.apache.org/jira/browse/HBASE-21545 > Project: HBase > Issue Type: New Feature > Components: API >Affects Versions: 2.1.1 > Environment: HBase 2.1.1 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one > column to be returned when columns are specified in Scan or Get query. The > result is always one first column by sorted order. I've attached a code > snipped to reproduce the issue that can be converted into a test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708100#comment-16708100 ] Andrey Elenskiy commented on HBASE-21476: - > Please use {{git format-patch}} to create future patches. (y) > What happens if a client that doesn't support nanoseconds attempts to write > to a table that is configured for nanoseconds? There's no error of any sort unless "base.hregion.keyvalue.timestamp.slop.millisecs" is specified. The value will be stored at the millisecond timestamp. It's up to the client to be careful here. > Do we need to account for this table attribute when bulk loading? Yes, that's a good point. As bulk import spits out hfiles directly, we should also account for this table attribute when adding keyvalues/puts. Will update the patch. > What about snapshots? do they retain information on wether their contents use > nanoseconds? Do tables cloned from a snapshot have to have the same > nanosecond config as the snapshot? yes, as snapshots include table attributes within `data.manifest`, cloning table from a snapshot will also include NANOSECOND_TIMESTAMPS attribute. > I see the WIP patches are starting to address MOB handling, but I don't see > it mentioned in the scope document at all. It's mentioned at the end of Technical Approach section, it's part of the effort to pass tabledescriptor to compactions. Will clarify update doc to clarify these points and also look into test failures. > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21476: Attachment: nanosecond_timestamps_v2.patch > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687312#comment-16687312 ] Andrey Elenskiy commented on HBASE-21476: - [~busbey] thanks for providing the examples of scope documents, those were helpful. I've attached one to this issue as requested. > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > nanosecond_timestamps_v1.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21476: Attachment: Apache HBase - Nanosecond Timestamps v1.pdf > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > nanosecond_timestamps_v1.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21476) Support for nanosecond timestamps
Andrey Elenskiy created HBASE-21476: --- Summary: Support for nanosecond timestamps Key: HBASE-21476 URL: https://issues.apache.org/jira/browse/HBASE-21476 Project: HBase Issue Type: New Feature Affects Versions: 2.1.1 Reporter: Andrey Elenskiy Assignee: Andrey Elenskiy Attachments: nanosecond_timestamps_v1.patch Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to handle timestamps with nanosecond precision. This is useful for applications that timestamp updates at the source with nanoseconds and still want features like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. The attribute should be specified either on new tables or on existing tables which have timestamps only with nanosecond precision. There's no migration from milliseconds to nanoseconds for already existing tables. We could add this migration as part of compaction if you think that would be useful, but that would obviously make the change more complex. I've added a new EnvironmentEdge method "currentTimeNano()" that uses [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] to get time in nanoseconds which means it will only work with Java 8. The idea is to gradually replace all places where "EnvironmentEdge.currentTime()" is used to have HBase working purely with nanoseconds (which is a prerequisite for HBASE-14070). Also, I've refactored ScanInfo and PartitionedMobCompactor to expect TableDescriptor as an argument which makes code a little cleaner and easier to extend. Couple more points: - column family TTL (specified in seconds) and "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options don't need to be changed, those are adjusted automatically. - Per cell TTL needs to be scaled by clients accordingly after "NANOSECOND_TIMESTAMPS" table attribute is specified. Looking for everyone's feedback to know if that's a worthwhile direction. Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587958#comment-16587958 ] Andrey Elenskiy commented on HBASE-21032: - Even though we don't use read replicas, it seems like it's the same issue. The last row loaded from hbase:meta was: 18/08/21 19:46:21 INFO assignment.RegionStateStore: Load hbase:meta entry region=2ad4d95f7b9d9ba0d746b8da50a7f9a7, regionState=OPEN, lastHost=regionserver-0,16020,1534538897512, regionLocation=regionserver-3,16020,1533765879856, openSeqNum=172 Which, based on region encoded name, is hbase:namespace. However, the regionserver is different. I think I'm going to halt deploying the patched version for now. Instead, I've run App.java that I've attached and it looks like it behaves as expected. > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Performance, Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.1 > > Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, > HBASE-21032.patch > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > ScanResponse being around 2MB with multiple cells while all of the consequent > ones being 1 cell per ScanResponse. After digging more, I found that each of > those single cell ScanResponse partials are preceded by a heartbeat (zero > cells). This results in two requests per cell to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587865#comment-16587865 ] Andrey Elenskiy commented on HBASE-21032: - I'm hitting another issue I've had before with 2.1 when upgrading dev cluster. HBase master is stuck initializing because it can't get to hbase:namespace region: {{18/08/21 18:52:28 INFO client.RpcRetryingCallerImpl: Call exception, tries=32, retries=46, started=471534 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:namespace,,1508805323559.2ad4d95f7b9d9ba0d746b8da50a7f9a7. is not online on regionserver-0,16020,1534877035908}} {{ at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3287)}} {{ at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3264)}} {{ at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1428)}} {{ at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2443)}} {{ at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)}} {{ at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)}} {{ at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)}} {{ at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)}} {{ at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)}} {{, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1508805323559.2ad4d95f7b9d9ba0d746b8da50a7f9a7., hostname=regionserver-0,16020,1534538897512, seqNum=172}} And there's nothing about hbase:namespace in regionserver-0's logs. Don't really know how to work around this. > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Performance, Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.1 > > Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, > HBASE-21032.patch > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > ScanResponse being around 2MB with multiple cells while all of the consequent > ones being 1 cell per ScanResponse. After digging more, I found that each of > those single cell ScanResponse partials are preceded by a heartbeat (zero > cells). This results in two requests per cell to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587732#comment-16587732 ] Andrey Elenskiy commented on HBASE-21032: - Sure, giving it a try by applying to 5a40eae63e290c8a12b1e7d4dd01fc98ba09573d of branch-2.1 and deploying onto our dev. > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Performance, Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.1 > > Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, > HBASE-21032.patch > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > ScanResponse being around 2MB with multiple cells while all of the consequent > ones being 1 cell per ScanResponse. After digging more, I found that each of > those single cell ScanResponse partials are preceded by a heartbeat (zero > cells). This results in two requests per cell to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576525#comment-16576525 ] Andrey Elenskiy commented on HBASE-21032: - Yes, it should always try to fit MaxResultSize of Cells into a partial ScanResponse. You can try out the difference by running the code I provided on hbase 2.0.0 (will return only ~2-3 results) and hbase 2.1.0 returns ~260 ScanResponses (2X that if you account for heartbeats). > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > ScanResponse being around 2MB with multiple cells while all of the consequent > ones being 1 cell per ScanResponse. After digging more, I found that each of > those single cell ScanResponse partials are preceded by a heartbeat (zero > cells). This results in two requests per cell to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21032: Description: I have a long row with a bunch of columns that I'm scanning with setAllowPartialResults(true). In the response I'm getting the first partial ScanResponse being around 2MB with multiple cells while all of the consequent ones being 1 cell per ScanResponse. After digging more, I found that each of those single cell ScanResponse partials are preceded by a heartbeat (zero cells). This results in two requests per cell to a regionserver. I've attached code to reproduce it on hbase version 2.1.0 (it works as expected on 2.0.0 and 2.0.1). [^App.java] I'm fairly certain it's a serverside issue as [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I have not tried to reproduce this with multi-row scan. was: I have a long row with a bunch of columns that I'm scanning with setAllowPartialResults(true). In the response I'm getting the first partial being around 2MB while all of the consequent ones being 1 column per partial. After digging more, I found that each of those single column partials are preceded by a heartbeat response (zero cells). This results in two request per column to a regionserver. I've attached code to reproduce it on hbase version 2.1.0 (it works as expected on 2.0.0 and 2.0.1). [^App.java] I'm fairly certain it's a serverside issue as [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I have not tried to reproduce this with multi-row scan. > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > ScanResponse being around 2MB with multiple cells while all of the consequent > ones being 1 cell per ScanResponse. After digging more, I found that each of > those single cell ScanResponse partials are preceded by a heartbeat (zero > cells). This results in two requests per cell to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21032) ScanResponses contain only one cell each
[ https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-21032: Summary: ScanResponses contain only one cell each (was: ScanResponse returns a partial result per cell) > ScanResponses contain only one cell each > > > Key: HBASE-21032 > URL: https://issues.apache.org/jira/browse/HBASE-21032 > Project: HBase > Issue Type: Bug > Components: Scanners >Affects Versions: 2.1.0 > Environment: HBase 2.1.0 > Hadoop 2.8.4 > Java 8 >Reporter: Andrey Elenskiy >Priority: Major > Attachments: App.java > > > I have a long row with a bunch of columns that I'm scanning with > setAllowPartialResults(true). In the response I'm getting the first partial > being around 2MB while all of the consequent ones being 1 column per partial. > After digging more, I found that each of those single column partials are > preceded by a heartbeat response (zero cells). This results in two request > per column to a regionserver. > I've attached code to reproduce it on hbase version 2.1.0 (it works as > expected on 2.0.0 and 2.0.1). > [^App.java] > I'm fairly certain it's a serverside issue as > [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I > have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21032) ScanResponse returns a partial result per cell
Andrey Elenskiy created HBASE-21032: --- Summary: ScanResponse returns a partial result per cell Key: HBASE-21032 URL: https://issues.apache.org/jira/browse/HBASE-21032 Project: HBase Issue Type: Bug Components: Scanners Affects Versions: 2.1.0 Environment: HBase 2.1.0 Hadoop 2.8.4 Java 8 Reporter: Andrey Elenskiy Attachments: App.java I have a long row with a bunch of columns that I'm scanning with setAllowPartialResults(true). In the response I'm getting the first partial being around 2MB while all of the consequent ones being 1 column per partial. After digging more, I found that each of those single column partials are preceded by a heartbeat response (zero cells). This results in two request per column to a regionserver. I've attached code to reproduce it on hbase version 2.1.0 (it works as expected on 2.0.0 and 2.0.1). [^App.java] I'm fairly certain it's a serverside issue as [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I have not tried to reproduce this with multi-row scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-16110) AsyncFS WAL doesn't work with Hadoop 2.8+
[ https://issues.apache.org/jira/browse/HBASE-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530323#comment-16530323 ] Andrey Elenskiy edited comment on HBASE-16110 at 7/2/18 7:14 PM: - Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 2.8.4 client ([http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/]). Got the following exception on regionserver which brings it down: {{ 18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete()}} {{ java.lang.Error: Couldn't properly initialize access to HDFS internals. Please update your WAL Provider to not make use of the 'asyncfs' provider. See HBASE-16110 for more information.}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.access$300(FanOutOneBlockAsyncDFSOutputHelper.java:118)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:720)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:715)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:638)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:676)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:552)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)}} {{ at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)}} {{ at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)}} {{ at java.lang.Thread.run(Thread.java:748)}} {{ Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(org.apache.hadoop.fs.FileEncryptionInfo)}} {{ at java.lang.Class.getDeclaredMethod(Class.java:2130)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.createTransparentCryptoHelper(FanOutOneBlockAsyncDFSOutputSaslHelper.java:232)}} {{ at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:262)}} {{ ... 18 more}} FYI, we don't have encryption enabled. Let me know if you need more info about our setup. was (Author: timoha): Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 2.8.4 client (http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/). Got the following exception on regionserver which brings it down: ``` 18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete() java.lang.Error: Couldn't properly initialize access to HDFS internals. Please update your WAL Provider to not make use of the 'asyncfs' provider. See HBASE-16110 for more information. at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661) at
[jira] [Commented] (HBASE-16110) AsyncFS WAL doesn't work with Hadoop 2.8+
[ https://issues.apache.org/jira/browse/HBASE-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530323#comment-16530323 ] Andrey Elenskiy commented on HBASE-16110: - Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 2.8.4 client (http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/). Got the following exception on regionserver which brings it down: ``` 18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete() java.lang.Error: Couldn't properly initialize access to HDFS internals. Please update your WAL Provider to not make use of the 'asyncfs' provider. See HBASE-16110 for more information. at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.access$300(FanOutOneBlockAsyncDFSOutputHelper.java:118) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:720) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:715) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:638) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:676) at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:552) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(org.apache.hadoop.fs.FileEncryptionInfo) at java.lang.Class.getDeclaredMethod(Class.java:2130) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.createTransparentCryptoHelper(FanOutOneBlockAsyncDFSOutputSaslHelper.java:232) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:262) ... 18 more ``` FYI, we don't have encryption enabled. > AsyncFS WAL doesn't work with Hadoop 2.8+ > - > > Key: HBASE-16110 > URL: https://issues.apache.org/jira/browse/HBASE-16110 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.0 >Reporter: Sean Busbey >Assignee: Duo Zhang >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-16110-v1.patch, HBASE-16110.patch > > > The async wal implementation doesn't work with Hadoop 2.8+. Fails compilation > and will fail running. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-16244) LocalHBaseCluster start timeout should be configurable
[ https://issues.apache.org/jira/browse/HBASE-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138860#comment-16138860 ] Andrey Elenskiy commented on HBASE-16244: - any reason this fix didn't make it to 1.3.x? We are hitting this in our integration tests as well on 1.3.1. > LocalHBaseCluster start timeout should be configurable > -- > > Key: HBASE-16244 > URL: https://issues.apache.org/jira/browse/HBASE-16244 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.0.1.1 >Reporter: Siddharth Wagle > Fix For: 2.0.0, 1.4.0, 0.98.21 > > Attachments: HBASE-16244.patch > > > *Scenario*: > - Ambari metrics service uses HBase in standalone mode > - On restart of AMS HBase, the Master gives up in 30 seconds due to a > hardcoded timeout in JVMClusterUtil > {noformat} > 2016-07-18 19:24:44,199 ERROR [main] master.HMasterCommandLine: Master exiting > java.lang.RuntimeException: Master not active after 30 seconds > at > org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:194) > at > org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:445) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:227) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:139) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) > at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2526) > {noformat} > - On restart the current Master waits to become active and this leads to the > timeout being triggered, waiting for a slightly longer time evades this issue. > - The timeout it seems was meant for unit tests > Attached patch allows the timeout to be configured via hbase-site as well as > sets it to 5 minutes for clusters started through HMasterCommandLine. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18066) Get with closest_row_before on "hbase:meta" can return empty Cell during region merge/split
[ https://issues.apache.org/jira/browse/HBASE-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Elenskiy updated HBASE-18066: Description: During region split/merge there's a brief period of time where doing a "Get" with "closest_row_before=true" on "hbase:meta" may return empty "GetResponse.result.cell" field even though parent, splitA and splitB regions are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as "TableDoesNotExist", which is returned to the client. Here's a gist that reproduces this problem: https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you have to use older HTable client (I used 1.2.4) as current versions ignore `Get.setClosestRowBefore(bool)` option. was: During region split/merge there's a brief period of time where doing a "Get" with "closest_row_before=true" on "hbase:meta" may return empty "GetResponse.result.cell" field even though parent, splitA and splitB regions are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as "TableDoesNotExist" which is returned to the client. Here's a gist that reproduces this problem: https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you have to use older HTable client (I used 1.2.4) as current versions ignore `Get.setClosestRowBefore(bool)` option. > Get with closest_row_before on "hbase:meta" can return empty Cell during > region merge/split > --- > > Key: HBASE-18066 > URL: https://issues.apache.org/jira/browse/HBASE-18066 > Project: HBase > Issue Type: Bug > Components: hbase, regionserver >Affects Versions: 1.3.1 > Environment: Linux (16.04.2), MacOS 10.11.6. > Standalone and distributed HBase setup. >Reporter: Andrey Elenskiy > > During region split/merge there's a brief period of time where doing a "Get" > with "closest_row_before=true" on "hbase:meta" may return empty > "GetResponse.result.cell" field even though parent, splitA and splitB regions > are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and > AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as > "TableDoesNotExist", which is returned to the client. > Here's a gist that reproduces this problem: > https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that > you have to use older HTable client (I used 1.2.4) as current versions ignore > `Get.setClosestRowBefore(bool)` option. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-18066) Get with closest_row_before on "hbase:meta" can return empty Cell during region merge/split
Andrey Elenskiy created HBASE-18066: --- Summary: Get with closest_row_before on "hbase:meta" can return empty Cell during region merge/split Key: HBASE-18066 URL: https://issues.apache.org/jira/browse/HBASE-18066 Project: HBase Issue Type: Bug Components: hbase, regionserver Affects Versions: 1.3.1 Environment: Linux (16.04.2), MacOS 10.11.6. Standalone and distributed HBase setup. Reporter: Andrey Elenskiy During region split/merge there's a brief period of time where doing a "Get" with "closest_row_before=true" on "hbase:meta" may return empty "GetResponse.result.cell" field even though parent, splitA and splitB regions are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as "TableDoesNotExist" which is returned to the client. Here's a gist that reproduces this problem: https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you have to use older HTable client (I used 1.2.4) as current versions ignore `Get.setClosestRowBefore(bool)` option. -- This message was sent by Atlassian JIRA (v6.3.15#6346)