[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master

2021-03-01 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-25480:

Labels: JMX NullPointerException master  (was: )

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Andrey Elenskiy
>Assignee: Anjan Das
>Priority: Major
>  Labels: JMX, NullPointerException, master
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master

2021-03-01 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-25480:

Affects Version/s: 2.4.1

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Andrey Elenskiy
>Assignee: Anjan Das
>Priority: Major
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25480) NPE when getting metrics of backup master

2021-01-11 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263043#comment-17263043
 ] 

Andrey Elenskiy commented on HBASE-25480:
-

Yup, please go ahead [~dasanjan1296] !

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.0
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master

2021-01-11 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-25480:

Issue Type: Bug  (was: Brainstorming)

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.0
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25480) NPE when getting metrics of backup master

2021-01-08 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261478#comment-17261478
 ] 

Andrey Elenskiy commented on HBASE-25480:
-

Updated to major because it breaks observability of backup masters making it 
look like there are none in Prometheus for example.

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Brainstorming
>  Components: master
>Affects Versions: 2.4.0
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25480) NPE when getting metrics of backup master

2021-01-08 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-25480:

Priority: Major  (was: Minor)

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Brainstorming
>  Components: master
>Affects Versions: 2.4.0
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25480) NPE when getting metrics of backup master

2021-01-07 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-25480:
---

 Summary: NPE when getting metrics of backup master
 Key: HBASE-25480
 URL: https://issues.apache.org/jira/browse/HBASE-25480
 Project: HBase
  Issue Type: Brainstorming
  Components: master
Affects Versions: 2.4.0
Reporter: Andrey Elenskiy


Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
when getting metrics via JMX on backup master. It appears due to the fact that 
regionNormalizerManager is null in backup masters as it's only initialized by 
HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2020-11-17 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233860#comment-17233860
 ] 

Andrey Elenskiy commented on HBASE-22665:
-

Hitting exactly the same issue on 2.2.4. The issue was exposed when the host 
was under massive load causing various pauses and OOMs across the OS.
 All the handlers are stuck on SyncFuture.get() while AsyncFSWal thread is just 
waiting on condition. Also, regionserver didn't trigger 
AbstractFSWAL.shutdown() initially and was stuck without it. Only once I 
connected with jdb to check the state, the shutdown() was called and it was 
stuck exiting until a timeout killed the process (prior to that it was stuck 
appending). Here is how the handlers look like:
{code:java}
Thread 48 (RpcServer.default.FPBQ.Fifo.handler=7,queue=1,port=16201):
 State: TIMED_WAITING
 Blocked count: 1193588
 Waited count: 2105228
 Stack:
 java.lang.Object.wait(Native Method)
 org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142)
 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:722)
 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:637)
 org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8588)
 org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:7946)
 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4130)
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4072)
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4003)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1042)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:974)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:937)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2755)
 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42290)
 org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
 org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
 org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
 org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318){code}
How AsynFSWal looks like:
{code:java}
Thread 164 (AsyncFSWAL-0):
 State: WAITING
 Blocked count: 6198
 Waited count: 32255423
 Waiting on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@bf1c663
 Stack:
 sun.misc.Unsafe.park(Native Method)
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 java.lang.Thread.run(Thread.java:748){code}
There were lots of logs like this right before it got stuck:

 
{code:java}
2020-11-17 14:37:53,902 WARN [AsyncFSWAL-0] wal.AsyncFSWAL: sync failed
java.io.IOException: Connection to 192.168.2.23/192.168.2.23:15010 closed
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:289)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:236)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:99)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:288)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
 at 
org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
 at 

[jira] [Updated] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2020-11-17 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-22665:

Attachment: hbase.log

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: HBASE-22665-UT.patch, hbase.log, 
> image-2019-07-08-16-07-37-664.png, image-2019-07-08-16-08-26-777.png, 
> image-2019-07-08-16-14-43-455.png, jstack_20190625, jstack_20190704_1, 
> jstack_20190704_2, rs.log.part1, rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24920) A tool to rewrite corrupted HFiles

2020-08-31 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188033#comment-17188033
 ] 

Andrey Elenskiy commented on HBASE-24920:
-

Was away for a week and just attempted to write a prototype. The tool is fairly 
simple: if HFileScanner.next() throws exception, search for DATABLK magic 
offset in FSInputStream and resume from there.

However, I realized that it wouldn't be possible to have it be part of hbase 
operator tools. I rely on some private classes in 
org.apache.hadoop.hbase.io.hfile package which change quite a bit between 
versions (2.2.5, 2.3.0, master are all different) making it impossible as a 
drop in jar. So, there are couple of choices here:
1. package the tool with dependencies and pin to some hbase version (which one?)
2. package along with hbase being part of "hbase hfile" command (probably why 
[~stack] suggested it in the first place ;))
3. completely re-implement needed pieces in hbase operator tool so that there 
are no dependencies on hbase version (making this tool hard to maintain).

> A tool to rewrite corrupted HFiles
> --
>
> Key: HBASE-24920
> URL: https://issues.apache.org/jira/browse/HBASE-24920
> Project: HBase
>  Issue Type: Brainstorming
>  Components: hbase-operator-tools
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Typically I have been dealing with corrupted HFiles (due to loss of hdfs 
> blocks) by just removing them. However, It always seemed wasteful to throw 
> away the entire HFile (which can be hundreds of gigabytes), just because one 
> hdfs block is missing (128MB).
> I think there's a possibility for a tool that can rewrite an HFile by 
> skipping corrupted blocks. 
> There can be multiple types of issues with hdfs blocks but any of them can be 
> treated as if the block doesn't exist:
> 1. All the replicas can be lost
> 2. The block can be corrupted due to some bug in hdfs (I've recently run into 
> HDFS-15186 by experimenting with EC).
> At the simplest the tool can be a local mapreduce job (mapper only) with a 
> custom HFile reader input that can seek to next DATABLK to skip corrupted 
> hdfs blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24920) A tool to rewrite corrupted HFiles

2020-08-20 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181463#comment-17181463
 ] 

Andrey Elenskiy commented on HBASE-24920:
-

I currently like that `hbase hfile` is read only and allows to debug things (I 
use it for debugging the corrupt blocks at the moment), the name of the class 
is even HFilePrettyPrinter. On the mailing list Sean suggested to put it in the 
repo with `hbase-operator-tools` and I prefer that as it aligns better with the 
goal of that repo (correct bugs/inconsistencies via operator intervention). 
What do you think?

> A tool to rewrite corrupted HFiles
> --
>
> Key: HBASE-24920
> URL: https://issues.apache.org/jira/browse/HBASE-24920
> Project: HBase
>  Issue Type: Brainstorming
>  Components: hbase-operator-tools
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Typically I have been dealing with corrupted HFiles (due to loss of hdfs 
> blocks) by just removing them. However, It always seemed wasteful to throw 
> away the entire HFile (which can be hundreds of gigabytes), just because one 
> hdfs block is missing (128MB).
> I think there's a possibility for a tool that can rewrite an HFile by 
> skipping corrupted blocks. 
> There can be multiple types of issues with hdfs blocks but any of them can be 
> treated as if the block doesn't exist:
> 1. All the replicas can be lost
> 2. The block can be corrupted due to some bug in hdfs (I've recently run into 
> HDFS-15186 by experimenting with EC).
> At the simplest the tool can be a local mapreduce job (mapper only) with a 
> custom HFile reader input that can seek to next DATABLK to skip corrupted 
> hdfs blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24920) A tool to rewrite corrupted HFiles

2020-08-20 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24920:
---

 Summary: A tool to rewrite corrupted HFiles
 Key: HBASE-24920
 URL: https://issues.apache.org/jira/browse/HBASE-24920
 Project: HBase
  Issue Type: Brainstorming
  Components: hbase-operator-tools
Reporter: Andrey Elenskiy


Typically I have been dealing with corrupted HFiles (due to loss of hdfs 
blocks) by just removing them. However, It always seemed wasteful to throw away 
the entire HFile (which can be hundreds of gigabytes), just because one hdfs 
block is missing (128MB).

I think there's a possibility for a tool that can rewrite an HFile by skipping 
corrupted blocks. 

There can be multiple types of issues with hdfs blocks but any of them can be 
treated as if the block doesn't exist:
1. All the replicas can be lost
2. The block can be corrupted due to some bug in hdfs (I've recently run into 
HDFS-15186 by experimenting with EC).

At the simplest the tool can be a local mapreduce job (mapper only) with a 
custom HFile reader input that can seek to next DATABLK to skip corrupted hdfs 
blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24919) A tool to rewrite corrupted HFiles

2020-08-20 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24919:
---

 Summary: A tool to rewrite corrupted HFiles
 Key: HBASE-24919
 URL: https://issues.apache.org/jira/browse/HBASE-24919
 Project: HBase
  Issue Type: Brainstorming
  Components: hbase-operator-tools
Reporter: Andrey Elenskiy


Typically I have been dealing with corrupted HFiles (due to loss of hdfs 
blocks) by just removing them. However, It always seemed wasteful to throw away 
the entire HFile (which can be hundreds of gigabytes), just because one hdfs 
block is missing (128MB).

I think there's a possibility for a tool that can rewrite an HFile by skipping 
corrupted blocks. 

There can be multiple types of issues with hdfs blocks but any of them can be 
treated as if the block doesn't exist:
1. All the replicas can be lost
2. The block can be corrupted due to some bug in hdfs (I've recently run into 
HDFS-15186 by experimenting with EC).

At the simplest the tool can be a local mapreduce job (mapper only) with a 
custom HFile reader input that can seek to next DATABLK to skip corrupted hdfs 
blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI

2020-05-28 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119226#comment-17119226
 ] 

Andrey Elenskiy commented on HBASE-24438:
-

The only way I was able to reproduce this is by restarting master right after 
it handled SCP for a regionserver. Looking at the code I've found this function 
in ServerCrashProcedure.java that handles setting of the task:

{code:java}
  void updateProgress(boolean updateState) {
String msg = "Processing ServerCrashProcedure of " + serverName;
if (status == null) {
  status = TaskMonitor.get().createStatus(msg);
  return;
}
if (currentRunningState == ServerCrashState.SERVER_CRASH_FINISH) {
  status.markComplete(msg + " done");
  return;
}
if (updateState) {
  currentRunningState = getCurrentState();
}
int childrenLatch = getChildrenLatch();
status.setStatus(msg + " current State " + currentRunningState
+ (childrenLatch > 0 ? "; remaining num of running child procedures = " 
+ childrenLatch
: ""));
  }
{code}

Given that the "status" part of the MonitoredTask says "null" (case see that UI 
just shows "since" for Status), it means that updateProgress was called only 
once. Then looking at the places where the updateProgress can be called are:
* executeFromState() called by ProcedureExecutor and cannot be the one as SCP 
is not listed in Procedures & Locks tab and never actually "Started"
* TransitRegionStateProcedure.confirmOpened calls 
ServerCrashProcedure.updateProgress and also cannot be the one as SCP is not in 
procedures list
* deserializeStateData() is called when the procedure WAL is de-serialized and 
is likely responsible for this "stale" task

The procedure never materializes because it has actually been completed by 
previous master, so it just gets de-serialized and thrown away by the new 
master. Does that sounds like a possible case?

It looks like that side effect was indeed added to deserializeStateData by 
https://issues.apache.org/jira/browse/HBASE-21647 but unfortunately I don't see 
any comments explaining the reason. Any ideas?

> Stale ServerCrashProcedure task in HBase Master UI
> --
>
> Key: HBASE-24438
> URL: https://issues.apache.org/jira/browse/HBASE-24438
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> HDFS 3.1.3 with erasure coding enabled
> Kubernetes
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
> ServerCrashProcedure after master failover. The procedures have finished with 
> SUCCESS on a previously active HBase master and aren't showing in "Procedures 
> & Locks".
> Based on the logs, both of those regionserver were carrying hbase:meta (logs 
> are sorted newest first grepped for those specific servers that have stale 
> ServerCrashProcedures):
> {noformat}
> 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
> procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true in 2.5290sec
>  2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: removed crashed server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
> splitting done
>  2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
> Log dir for server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not 
> exist
>  2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
>  2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: pid=38, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true found RIT pid=20, ppid=18, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=ABNORMALLY_CLOSED, 
> location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> table=hbase:meta, region=1588230740
>  2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
>  in 0ms

[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-28 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119146#comment-17119146
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

> When did K8S register in DNS the new pod?

Given the eventually consistent nature of k8s, it's possible that the mapping 
in DNS is updated after the new regionserver pod has already started. 
Unfortunately, I can't verify if that's the case as DNS isn't managed by us so 
I can't add extra logging there. I think for the sake of argument we can assume 
that the DNS mapping is inconsistent. Although, that could be the case on any 
infra as DNS can be inconsistent due to caching in multiple places 
(systemd-resolved, intermediate DNS servers, java dns cache, etc) or just 
operators being slow to update them.

> What happens if you run w/ -Dsun.net.inetaddr.ttl=1 instead of 10?

I was able to reproduce this issue with ttl=1 as well as ttl=0 (which I guess 
means no caching).

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> 
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> 

[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI

2020-05-28 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118895#comment-17118895
 ] 

Andrey Elenskiy commented on HBASE-24438:
-

I guess the issue is "staleness". The procedures clearly finished during the 
reign of previously active master. However, the new master still shows the 
ServerCrashProcedures as running for multiple hours in "Task" -> "show non-RPC 
Tasks" UI. Let me know if I should rephrase the issue/provide more info to help 
you identify the root cause.

> Stale ServerCrashProcedure task in HBase Master UI
> --
>
> Key: HBASE-24438
> URL: https://issues.apache.org/jira/browse/HBASE-24438
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> HDFS 3.1.3 with erasure coding enabled
> Kubernetes
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
> ServerCrashProcedure after master failover. The procedures have finished with 
> SUCCESS on a previously active HBase master and aren't showing in "Procedures 
> & Locks".
> Based on the logs, both of those regionserver were carrying hbase:meta (logs 
> are sorted newest first grepped for those specific servers that have stale 
> ServerCrashProcedures):
> {noformat}
> 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
> procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true in 2.5290sec
>  2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: removed crashed server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
> splitting done
>  2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
> Log dir for server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not 
> exist
>  2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
>  2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: pid=38, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true found RIT pid=20, ppid=18, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=ABNORMALLY_CLOSED, 
> location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> table=hbase:meta, region=1588230740
>  2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
>  in 0ms
>  2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
>  dir is empty, no logs to split.
>  2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: Start pid=38, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true
>  2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=38 for 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 
> (carryingMeta=true) 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: 
> Processing expiration of 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on 
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> ...
> 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] 
> 

[jira] [Commented] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI

2020-05-26 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116916#comment-17116916
 ] 

Andrey Elenskiy commented on HBASE-24438:
-

Looks like the showing of ServerCrashProcedure tasks in the UI was introduced 
in https://issues.apache.org/jira/browse/HBASE-21647

> Stale ServerCrashProcedure task in HBase Master UI
> --
>
> Key: HBASE-24438
> URL: https://issues.apache.org/jira/browse/HBASE-24438
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> HDFS 3.1.3 with erasure coding enabled
> Kubernetes
>Reporter: Andrey Elenskiy
>Priority: Major
>
> Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
> ServerCrashProcedure after master failover. The procedures have finished with 
> SUCCESS on a previously active HBase master and aren't showing in "Procedures 
> & Locks".
> Based on the logs, both of those regionserver were carrying hbase:meta (logs 
> are sorted newest first grepped for those specific servers that have stale 
> ServerCrashProcedures):
> {noformat}
> 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
> procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true in 2.5290sec
>  2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: removed crashed server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
> splitting done
>  2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
> Log dir for server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not 
> exist
>  2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
>  2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: pid=38, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true found RIT pid=20, ppid=18, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=ABNORMALLY_CLOSED, 
> location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> table=hbase:meta, region=1588230740
>  2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
>  in 0ms
>  2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
>  dir is empty, no logs to split.
>  2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: Start pid=38, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true
>  2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=38 for 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 
> (carryingMeta=true) 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: 
> Processing expiration of 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on 
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> ...
> 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] 
> procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> splitWal=true, meta=true in 997msec
>  2020-05-21 

[jira] [Created] (HBASE-24438) Stale ServerCrashProcedure task in HBase Master UI

2020-05-26 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24438:
---

 Summary: Stale ServerCrashProcedure task in HBase Master UI
 Key: HBASE-24438
 URL: https://issues.apache.org/jira/browse/HBASE-24438
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.2.4
 Environment: HBase 2.2.4
HDFS 3.1.3 with erasure coding enabled
Kubernetes
Reporter: Andrey Elenskiy


Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
ServerCrashProcedure after master failover. The procedures have finished with 
SUCCESS on a previously active HBase master and aren't showing in "Procedures & 
Locks".

Based on the logs, both of those regionserver were carrying hbase:meta (logs 
are sorted newest first grepped for those specific servers that have stale 
ServerCrashProcedures):
{noformat}
2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true in 2.5290sec
 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: removed crashed server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
splitting done
 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
dead splitlog workers 
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
Log dir for server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not exist
 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: pid=38, 
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true found RIT pid=20, ppid=18, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=ABNORMALLY_CLOSED, 
location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
table=hbase:meta, region=1588230740
 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
[hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
 in 0ms
 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
 dir is empty, no logs to split.
 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
dead splitlog workers 
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: Start pid=38, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true
 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=38 for 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 
(carryingMeta=true) 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: 
Processing expiration of 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on 
hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
...
2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] 
procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; 
ServerCrashProcedure 
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
splitWal=true, meta=true in 997msec
 2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27] 
procedure.ServerCrashProcedure: removed crashed server 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after 
splitting done
 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
dead splitlog workers 
[regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager: 
Log dir for server 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does 

[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-22 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114447#comment-17114447
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

> Oh, I bet HDFS gets confused too... (but maybe not – IIRC, it creates' a name 
> to use referring to the DN...) Let me check logs.

So for hadoop we route clients by hostnames (dfs.client.use.datanode.hostname) 
and  provision a k8s service per datanode which results in a stable IP per 
datanode. That's a workaround to the bug 
(https://issues.apache.org/jira/browse/HDFS-15250), which I don't think was 
properly addressed there. Otherwise, we could have just relied on hostnames of 
the pods without needing a service. 
(During the pod restart, its hostname is also removed from DNS resulting in 
UnresolvedHostnameException for clients).

The most ideal for hbase on k8s would be to not cache any IPs (stateless 
connections) and not rely on hostnames (kinda like kafka brokers) but that's 
probably not easy to change.

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> 
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> 

[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-21 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113423#comment-17113423
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

Attached entire hbasemaster log (hbasemaster.log) with TRACE enabled right 
before trying to reproduce the issue.

The time I've triggered the issue was "Thu May 21 17:28:42 UTC 2020". And the 
topology looked like so:
{noformat}
hbasemaster-0 10.128.25.30
hbasemaster-1 10.128.6.51 
regionserver-0 10.128.53.53
regionserver-1 10.128.9.37 
regionserver-2 10.128.14.39{noformat}
 

They way I trigger the issue is by picking a regionserver with 0 regions 
(because it was restarted recently), triggering "balancer" and killing the 
regionserver during the execution of balancer. In this case the regionserver I 
killed was regionserver-2. Here's how topology looked like after regionserver 2 
came back up:

 
{noformat}
hbasemaster-0 10.128.25.30
hbasemaster-1 10.128.6.51 
regionserver-0 10.128.53.53
regionserver-1 10.128.9.37 
regionserver-2 10.128.14.40{noformat}
You can see that regionserver-2 came back up with IP 10.128.14.40, but 
hbasemaster still tries to contact 10.128.14.39

 

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> 
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN 

[jira] [Updated] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-21 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-22041:

Attachment: hbasemaster.log

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> 
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-20 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112500#comment-17112500
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

> It starts after the container comes back w/ new IP (looks the same though 
> across textboxes)? 

Right, the pod starts with new IP address, but for some reason master is still 
trying to reach old IP address

 

> What are the dns timeouts on this host?

We set -Dsun.net.inetaddr.ttl=10 option and it seems to work for other places. 
Is it necessary to set "networkaddress.cache.tt" option as well?

> The crashed node exists in onlineServer forever, and if it holds the meta 
> data, master will start up hang.
> --
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> 

[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-18 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110623#comment-17110623
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

Just reproduced again and I'm seeing ServerCrashProcedure being stuck for the 
regionserver that it's trying to reconnect to with 
state=WAITING:SERVER_CRASH_FINISH.

And, ServerCrashProcedure is waiting for TransitRegionStateProcedure procedure 
with state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED.
And, TransitRegionStateProcedure is waiting for OpenRegionProcedure procedure 
with state=RUNNABLE

Regions in transition are in OPENNING state for the regionserver that exists.

If I'm understanding the logs correctly, it's trying to connect to old IP 
address for the restarted regionserver. In kubernetes when pod is restarted it 
gets a new IP address and preserves hostname (if it's a statefulset). So, 
there's some assumption somewhere in HBase that IP address doesn't change or it 
caches the IP address resolution. In this particular case it looks like it's 
trying to correctly assign regions to the online regionserver but still uses 
old IP address.

> The crashed node exists in onlineServer forever, and if it holds the meta 
> data, master will start up hang.
> --
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> 

[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-18 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110601#comment-17110601
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

Here's the error that shows that the address is getting readded to failed 
servers:
{code:java}
2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: Building request with operations count=1
2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] 
ipc.NettyRpcConnection: Connecting to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
2020-05-18 17:52:20,254 TRACE [RS-EventLoopGroup-1-1] ipc.AbstractRpcClient: 
Call: ExecuteProcedures, callTime: 5ms
2020-05-18 17:52:20,254 DEBUG [RS-EventLoopGroup-1-1] ipc.FailedServers: Added 
failed server with address 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 to list caused 
by 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
2020-05-18 17:52:20,254 DEBUG [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: request to 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed, 
try=1480
2020-05-18 17:52:20,255 WARN [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: request to server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed due to 
java.net.ConnectException: Call to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on 
connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020, try=1480, 
retrying...
 ... 10 more
Caused by: java.net.ConnectException: finishConnect(..) failed: No route to host
 ... 6 more
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:644)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:667)
 at 
org.apache.hbase.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:269)
 at 
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors.throwConnectException(Errors.java:112)
Caused by: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
 at java.lang.Thread.run(Thread.java:748)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:328)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:631)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:261)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:267)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
 at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:924)
 at 

[jira] [Commented] (HBASE-22041) The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

2020-05-18 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110500#comment-17110500
 ] 

Andrey Elenskiy commented on HBASE-22041:
-

We are seeing the same issue on 2.2.4 running in kubernetes. The issue appears 
to be do to with the fact that address of failed regionserver keeps getting 
readded to failed servers list when RSProcedureDispatcher.sendRequest is called.

Here's with TRACE logging enabled:
{code:java}
2020-05-18 17:52:19,643 TRACE [RSProcedureDispatcher-pool3-t127] 
procedure.RSProcedureDispatcher: Building request with operations count=1
2020-05-18 17:52:19,644 DEBUG [RSProcedureDispatcher-pool3-t127] 
ipc.AbstractRpcClient: Not trying to connect to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 this server is 
in the failed servers list
2020-05-18 17:52:19,644 TRACE [RSProcedureDispatcher-pool3-t127] 
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms
2020-05-18 17:52:19,644 DEBUG [RSProcedureDispatcher-pool3-t127] 
procedure.RSProcedureDispatcher: request to 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed, 
try=1474
org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on local 
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
 at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
 at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
 at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
 at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
 at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
 at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
 ... 9 more
2020-05-18 17:52:19,644 WARN [RSProcedureDispatcher-pool3-t127] 
procedure.RSProcedureDispatcher: request to server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed due to 
org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on local 
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020, try=1474, 
retrying...{code}
In our case it doesn't recover automatically and we have to restart hbase 
master to get out of this issue.

> The crashed node exists in onlineServer forever, and if it holds the meta 
> data, master will start up hang.
> --
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
>  Issue Type: Bug
>Reporter: lujie
>Priority: Critical
> Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs 

[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles

2020-05-05 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100334#comment-17100334
 ] 

Andrey Elenskiy commented on HBASE-24273:
-

Great, thanks for a quick fix!

> HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
> 
>
> Key: HBASE-24273
> URL: https://issues.apache.org/jira/browse/HBASE-24273
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> Hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> This issue came up after merging regions. MergeTableRegionsProcedure removes 
> the parent regions from hbase:meta and creates HFile references in child 
> region to the old parent regions. Running `hbck_chore_run` right after the 
> `merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
> until major compaction is run on child region which will remove HFile 
> references and cause Catalog Janitor to clean up the parent regions.
> There are probably other situations which can cause the same issue (maybe 
> region split?)
> Having "Orphan Regions on FileSystem" list parent regions and suggest to 
> "_hbase completebulkload_" is dangerous in this case as completing bulk load 
> will lead to stale HFile references in child region which will cause its OPEN 
> to fail because referenced HFile doesn't exist.
> Figuring out these things for database administrators is tedious, so I think 
> it would be reasonable to not consider regions with referenced HFiles to be 
> orphans (or maybe could give an extra hint saying that it has referenced 
> HFiles).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans

2020-04-29 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095956#comment-17095956
 ] 

Andrey Elenskiy commented on HBASE-24255:
-

[~huaxiangsun] yes, you got the idea right.

> but somehow the merge*** qualifers were not cleaned up from new merged child 
> region in meta table (maybe master crashed before 
> GCMultipleMergedRegionsProcedure is started)

That's due to HBASE-24273 actually, addMissingRegionsInMeta will read those 
"orphans" without checking that merge qualifier exists. I think fixing 
HBASE-24273 will resolve this particular instance.

But I'm still wondering if there are other situations where GCRegionProcedure 
should also make sure that region is unassigned from regionserver and it would 
be more geneirc as I've seen it happen even without region merges (I don't 
recall the case anymore).

> GCRegionProcedure doesn't assign region from RegionServer leading to orphans
> 
>
> Key: HBASE-24255
> URL: https://issues.apache.org/jira/browse/HBASE-24255
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2, Region Assignment, regionserver
>Affects Versions: 2.2.4
> Environment: hbase 2.2.4
> hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Assignee: niuyulin
>Priority: Major
>
> We've found ourselves in a situation where parents of merged or split regions 
> needed to be opened again on a regionserver due to having to recover from 
> cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure 
> which requiters all regions to be merged to be open). Then, when a 
> GCProcedure is kicked of to clean a parent region up by 
> GCMultipleMergedRegionsProcedure, it ends up deleting it from hbase:meta, but 
> doesn't unassign it from RegionServer leading for it to show up in "Orphan 
> Regions on RegionServer" in hbck tab of HBase Master. Also, the hbase client 
> doesn't detect that the region is closed either because it's still 
> technically open on a regionserver (it doesn't reread hbase:meta all the 
> time). The only way to recover from this is to restart regionserver which 
> isn't idea as it can lead to other issues in clusters with region 
> inconsistencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans

2020-04-28 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094943#comment-17094943
 ] 

Andrey Elenskiy commented on HBASE-24255:
-

I don't really see how that addresses the issue in description. The problem is 
I was trying to describe can happen if I were to run HBCK2's 
addMissingRegionsInMeta which ends up readding parents of merged region into 
meta and assigns it to a RegionServer. Then, when GCRegionProcedure runs, it 
removes the region from hbase:meta and FS, but doesn't unassign the region from 
regionsserver. Hence, I'd like to see that GCRegionProcedure actually makes 
sure that the region is not assigned on any regionserver (leading to "Orphan 
Regions on RegionServer").

> GCRegionProcedure doesn't assign region from RegionServer leading to orphans
> 
>
> Key: HBASE-24255
> URL: https://issues.apache.org/jira/browse/HBASE-24255
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2, Region Assignment, regionserver
>Affects Versions: 2.2.4
> Environment: hbase 2.2.4
> hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Assignee: niuyulin
>Priority: Major
>
> We've found ourselves in a situation where parents of merged or split regions 
> needed to be opened again on a regionserver due to having to recover from 
> cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure 
> which requiters all regions to be merged to be open). Then, when a 
> GCProcedure is kicked of to clean a parent region up by 
> GCMultipleMergedRegionsProcedure, it ends up deleting it from hbase:meta, but 
> doesn't unassign it from RegionServer leading for it to show up in "Orphan 
> Regions on RegionServer" in hbck tab of HBase Master. Also, the hbase client 
> doesn't detect that the region is closed either because it's still 
> technically open on a regionserver (it doesn't reread hbase:meta all the 
> time). The only way to recover from this is to restart regionserver which 
> isn't idea as it can lead to other issues in clusters with region 
> inconsistencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region

2020-04-28 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094713#comment-17094713
 ] 

Andrey Elenskiy commented on HBASE-24250:
-

> the critical problem is why GCMultipleMergedRegionsProcedure not work

It did work, there were just so many of them because catalog janitor kept 
resubmitting it for the same regions over and over again.

> and then something when wrong and caused a different procedure to stall

We had an issue with running out of direct memory causing regionservers to 
crashloop and master to keep resubmitting region assignment. I think it's a 
separate issue from what I've addressed here.

> CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region
> -
>
> Key: HBASE-24250
> URL: https://issues.apache.org/jira/browse/HBASE-24250
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.2.4
> Environment: hdfs 3.1.3 with erasure coding
> hbase 2.2.4
>Reporter: Andrey Elenskiy
>Assignee: niuyulin
>Priority: Major
>
> If a lot of regions were merged (due to change of region sizes, for example), 
> there can be a long backlog of procedures to clean up the merged regions. If 
> going through this backlog is slower than the CatalogJanitor's scan interval, 
> it will end resubmitting GCMultipleMergedRegionsProcedure for the same 
> regions over and over again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles

2020-04-27 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094048#comment-17094048
 ] 

Andrey Elenskiy commented on HBASE-24273:
-

Hm, interesting that you say that the parent regions should be repopulated from 
hbase:meta and maybe that's the main root cause of issue here. 
MergeTableRegionsProcedure calls updateMetaForMergedRegions in 
MERGE_TABLE_REGIONS_UPDATE_META state. That function calls 
AssignmentManager.markRegionAsMerged which actually deletes region from 
hase:meta 
([https://github.com/apache/hbase/blob/346d087f409f9b44754d1d4426492c1ecd02ea89/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1866]
 which eventually calls MetaTableAccessor.mergeRegions which sends a delete to 
hbase:meta). So, that's why I observe that orphans in fs are around until major 
compaction is complete and catalog janitor GCs them.

If you say that it should repopulate from hbase:meta, maybe it shouldn't remove 
it from there?

> HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
> 
>
> Key: HBASE-24273
> URL: https://issues.apache.org/jira/browse/HBASE-24273
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> Hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Priority: Critical
> Fix For: 3.0.0, 2.3.0
>
>
> This issue came up after merging regions. MergeTableRegionsProcedure removes 
> the parent regions from hbase:meta and creates HFile references in child 
> region to the old parent regions. Running `hbck_chore_run` right after the 
> `merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
> until major compaction is run on child region which will remove HFile 
> references and cause Catalog Janitor to clean up the parent regions.
> There are probably other situations which can cause the same issue (maybe 
> region split?)
> Having "Orphan Regions on FileSystem" list parent regions and suggest to 
> "_hbase completebulkload_" is dangerous in this case as completing bulk load 
> will lead to stale HFile references in child region which will cause its OPEN 
> to fail because referenced HFile doesn't exist.
> Figuring out these things for database administrators is tedious, so I think 
> it would be reasonable to not consider regions with referenced HFiles to be 
> orphans (or maybe could give an extra hint saying that it has referenced 
> HFiles).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles

2020-04-27 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094016#comment-17094016
 ] 

Andrey Elenskiy commented on HBASE-24273:
-

But if master is restarted, in-memory state will be lost and we'd end up in the 
same situation which is unexpected IMHO. It will probalby be more reliable to 
actual cross reference the state against the FS given that the the category 
name is called "Orphan regions *on FileSystem*". HbckChore.loadRegionsFromFS 
looks like the place where the logic can be added to count back references.

> HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
> 
>
> Key: HBASE-24273
> URL: https://issues.apache.org/jira/browse/HBASE-24273
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> Hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Priority: Critical
> Fix For: 3.0.0, 2.3.0
>
>
> This issue came up after merging regions. MergeTableRegionsProcedure removes 
> the parent regions from hbase:meta and creates HFile references in child 
> region to the old parent regions. Running `hbck_chore_run` right after the 
> `merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
> until major compaction is run on child region which will remove HFile 
> references and cause Catalog Janitor to clean up the parent regions.
> There are probably other situations which can cause the same issue (maybe 
> region split?)
> Having "Orphan Regions on FileSystem" list parent regions and suggest to 
> "_hbase completebulkload_" is dangerous in this case as completing bulk load 
> will lead to stale HFile references in child region which will cause its OPEN 
> to fail because referenced HFile doesn't exist.
> Figuring out these things for database administrators is tedious, so I think 
> it would be reasonable to not consider regions with referenced HFiles to be 
> orphans (or maybe could give an extra hint saying that it has referenced 
> HFiles).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24189) Regionserver recreates region folders in HDFS after replaying WAL with removed table entries

2020-04-27 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093922#comment-17093922
 ] 

Andrey Elenskiy commented on HBASE-24189:
-

Haven't planned on a patch, not exactly certain what would be the right 
solution here without leading to data loss.

> But what if the table is deleted and not recreated. 
I haven't actually tested this, it could be the same case or maybe regionserver 
actually checks that the table doesn't exist anymore so it doesn't create 
directories.

> We might have to check whether the region exists or not also as part of the 
> last flushed seqId look up and if the regions does not exists at all, we 
> might have to just ignore those entries from WAL.
Would this be a safe thing to do? I'm not familiar with edge cases, but what 
would happen if WAL isn't flushed before the region is removed, it might cause 
data loss? For example, if region is split or merged and WAL isn't flushed 
prior to opening child region and closing parent regions (I don't know if it 
always gets flushed in those cases), then GCRegionProcedure will remove the 
parent regions and if there are still edits in WAL for parent regions that 
should be replayed into child region instead of getting discarded.

> Regionserver recreates region folders in HDFS after replaying WAL with 
> removed table entries
> 
>
> Key: HBASE-24189
> URL: https://issues.apache.org/jira/browse/HBASE-24189
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 2.2.4
> Environment: * HDFS 3.1.3
>  * HBase 2.1.4
>  * OpenJDK 8
>Reporter: Andrey Elenskiy
>Assignee: Anoop Sam John
>Priority: Major
>
> Under the following scenario region directories in HDFS can be recreated with 
> only recovered.edits in them:
>  # Create table "test"
>  # Put into "test"
>  # Delete table "test"
>  # Create table "test" again
>  # Crash the regionserver to which the put has went to force the WAL replay
>  # Region directory in old table is recreated in new table
>  # hbase hbck returns inconsistency
> This appears to happen due to the fact that WALs are not cleaned up once a 
> table is deleted and they still contain the edits from old table. I've tried 
> wal_roll command on the regionserver before crashing it, but it doesn't seem 
> to help as under some circumstances there are still WAL files around. The 
> only solution that works consistently is to restart regionserver before 
> creating the table at step 4 because that triggers log cleanup on startup: 
> [https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508|https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508)]
>  
> Truncating a table also would be a workaround by in our case it's a no-go as 
> we create and delete tables in our tests which run back to back (create table 
> in the beginning of the test and delete in the end of the test).
> A nice option in our case would be to provide hbase shell utility to force 
> clean up of log files manually as I realize that it's not really viable to 
> clean all of those up every time some table is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region

2020-04-27 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093918#comment-17093918
 ] 

Andrey Elenskiy commented on HBASE-24250:
-

Yes, GCMultipleMergedRegionsProcedure and GCRegionProcedure are idempotent. 
However, the piling up of those can lead to pretty annoying situation of having 
hbasemaster to churn through all of them before proceeding to actually useful 
procedures.

 

For example, in our cluster we ended up merging over 700 regions and then 
something when wrong and caused a different procedure to stall (which 
unfortunately happens more often then I would like to). As we didn't notice the 
issue of stalled procedure right away, we ended up with over 20k 
GCMultipleMergedRegionsProcedure in the backlog. It was quite tedious to figure 
out why we have so many of those, figure out that we need to disable catalog 
janitor, to bypass all of the CG procedures via HBCK2, and then get to actually 
fixing the stalled procedure. This caused pretty long downtime for the entire 
cluster.

> CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region
> -
>
> Key: HBASE-24250
> URL: https://issues.apache.org/jira/browse/HBASE-24250
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.2.4
> Environment: hdfs 3.1.3 with erasure coding
> hbase 2.2.4
>Reporter: Andrey Elenskiy
>Assignee: niuyulin
>Priority: Major
>
> If a lot of regions were merged (due to change of region sizes, for example), 
> there can be a long backlog of procedures to clean up the merged regions. If 
> going through this backlog is slower than the CatalogJanitor's scan interval, 
> it will end resubmitting GCMultipleMergedRegionsProcedure for the same 
> regions over and over again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles

2020-04-27 Thread Andrey Elenskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-24273:

Description: 
This issue came up after merging regions. MergeTableRegionsProcedure removes 
the parent regions from hbase:meta and creates HFile references in child region 
to the old parent regions. Running `hbck_chore_run` right after the 
`merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
until major compaction is run on child region which will remove HFile 
references and cause Catalog Janitor to clean up the parent regions.

There are probably other situations which can cause the same issue (maybe 
region split?)

Having "Orphan Regions on FileSystem" list parent regions and suggest to 
"_hbase completebulkload_" is dangerous in this case as completing bulk load 
will lead to stale HFile references in child region which will cause its OPEN 
to fail because referenced HFile doesn't exist.

Figuring out these things for database administrators is tedious, so I think it 
would be reasonable to not consider regions with referenced HFiles to be 
orphans (or maybe could give an extra hint saying that it has referenced 
HFiles).

  was:
This issue came up after merging regions. MergeTableRegionsProcedure removes 
the parent regions from hbase:meta and creates HFile references in child region 
to the old parent regions. Running `hbck_chore_run` right after the 
`merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
until major compaction is run on child region which will remove HFile 
references and cause Catalog Janitor to clean up the parent regions.

There are probably other situations which can cause the same issue (maybe 
region split?)

Having "Orphan Regions on FileSystem" list parent regions and suggest to 
"_hbase completebulkload_" is dangerous in this case as completing bulk load in 
this case will lead to stale HFile references in child region which will cause 
it's OPEN to fail because referenced HFile doesn't exist.

Figuring out these things for database administrators is tedious, so I think it 
would be reasonable to not consider regions with referenced  HFiles to be 
orphans (or maybe could give an extra hint saying that it has referenced 
HFiles).


> HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles
> 
>
> Key: HBASE-24273
> URL: https://issues.apache.org/jira/browse/HBASE-24273
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> Hadoop 3.1.3
>Reporter: Andrey Elenskiy
>Priority: Critical
> Fix For: 3.0.0, 2.3.0
>
>
> This issue came up after merging regions. MergeTableRegionsProcedure removes 
> the parent regions from hbase:meta and creates HFile references in child 
> region to the old parent regions. Running `hbck_chore_run` right after the 
> `merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
> until major compaction is run on child region which will remove HFile 
> references and cause Catalog Janitor to clean up the parent regions.
> There are probably other situations which can cause the same issue (maybe 
> region split?)
> Having "Orphan Regions on FileSystem" list parent regions and suggest to 
> "_hbase completebulkload_" is dangerous in this case as completing bulk load 
> will lead to stale HFile references in child region which will cause its OPEN 
> to fail because referenced HFile doesn't exist.
> Figuring out these things for database administrators is tedious, so I think 
> it would be reasonable to not consider regions with referenced HFiles to be 
> orphans (or maybe could give an extra hint saying that it has referenced 
> HFiles).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24273) HBCK's "Orphan Regions on FileSystem" reports regions with referenced HFiles

2020-04-27 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24273:
---

 Summary: HBCK's "Orphan Regions on FileSystem" reports regions 
with referenced HFiles
 Key: HBASE-24273
 URL: https://issues.apache.org/jira/browse/HBASE-24273
 Project: HBase
  Issue Type: Bug
  Components: hbck2
Affects Versions: 2.2.4
 Environment: HBase 2.2.4

Hadoop 3.1.3
Reporter: Andrey Elenskiy


This issue came up after merging regions. MergeTableRegionsProcedure removes 
the parent regions from hbase:meta and creates HFile references in child region 
to the old parent regions. Running `hbck_chore_run` right after the 
`merge_region` will show the parent regions in "Orphan Regions on FileSystem" 
until major compaction is run on child region which will remove HFile 
references and cause Catalog Janitor to clean up the parent regions.

There are probably other situations which can cause the same issue (maybe 
region split?)

Having "Orphan Regions on FileSystem" list parent regions and suggest to 
"_hbase completebulkload_" is dangerous in this case as completing bulk load in 
this case will lead to stale HFile references in child region which will cause 
it's OPEN to fail because referenced HFile doesn't exist.

Figuring out these things for database administrators is tedious, so I think it 
would be reasonable to not consider regions with referenced  HFiles to be 
orphans (or maybe could give an extra hint saying that it has referenced 
HFiles).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24255) GCRegionProcedure doesn't assign region from RegionServer leading to orphans

2020-04-24 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24255:
---

 Summary: GCRegionProcedure doesn't assign region from RegionServer 
leading to orphans
 Key: HBASE-24255
 URL: https://issues.apache.org/jira/browse/HBASE-24255
 Project: HBase
  Issue Type: Bug
  Components: proc-v2, Region Assignment, regionserver
Affects Versions: 2.2.4
 Environment: hbase 2.2.4

hadoop 3.1.3
Reporter: Andrey Elenskiy


We've found ourselves in a situation where parents of merged or split regions 
needed to be opened again on a regionserver due to having to recover from 
cluster meltdown (HBCK2's fixMeta kicks off GCMultipleMergedRegionsProcedure 
which requiters all regions to be merged to be open). Then, when a GCProcedure 
is kicked of to clean a parent region up by GCMultipleMergedRegionsProcedure, 
it ends up deleting it from hbase:meta, but doesn't unassign it from 
RegionServer leading for it to show up in "Orphan Regions on RegionServer" in 
hbck tab of HBase Master. Also, the hbase client doesn't detect that the region 
is closed either because it's still technically open on a regionserver (it 
doesn't reread hbase:meta all the time). The only way to recover from this is 
to restart regionserver which isn't idea as it can lead to other issues in 
clusters with region inconsistencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24250) CatalogJanitor resubmits GCMultipleMergedRegionsProcedure for the same region

2020-04-23 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24250:
---

 Summary: CatalogJanitor resubmits GCMultipleMergedRegionsProcedure 
for the same region
 Key: HBASE-24250
 URL: https://issues.apache.org/jira/browse/HBASE-24250
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.2.4
 Environment: hdfs 3.1.3 with erasure coding

hbase 2.2.4
Reporter: Andrey Elenskiy


If a lot of regions were merged (due to change of region sizes, for example), 
there can be a long backlog of procedures to clean up the merged regions. If 
going through this backlog is slower than the CatalogJanitor's scan interval, 
it will end resubmitting GCMultipleMergedRegionsProcedure for the same regions 
over and over again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24189) Regionserver recreates region folders in HDFS after replaying WAL with removed table entries

2020-04-14 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HBASE-24189:
---

 Summary: Regionserver recreates region folders in HDFS after 
replaying WAL with removed table entries
 Key: HBASE-24189
 URL: https://issues.apache.org/jira/browse/HBASE-24189
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 2.2.4
 Environment: * HDFS 3.1.3
 * HBase 2.1.4
 * OpenJDK 8
Reporter: Andrey Elenskiy


Under the following scenario region directories in HDFS can be recreated with 
only recovered.edits in them:
 # Create table "test"
 # Put into "test"
 # Delete table "test"
 # Create table "test" again
 # Crash the regionserver to which the put has went to force the WAL replay
 # Region directory in old table is recreated in new table
 # hbase hbck returns inconsistency

This appears to happen due to the fact that WALs are not cleaned up once a 
table is deleted and they still contain the edits from old table. I've tried 
wal_roll command on the regionserver before crashing it, but it doesn't seem to 
help as under some circumstances there are still WAL files around. The only 
solution that works consistently is to restart regionserver before creating the 
table at step 4 because that triggers log cleanup on startup: 
[https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508|https://github.com/apache/hbase/blob/f3ee9b8aa37dd30d34ff54cd39fb9b4b6d22e683/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L508)]

 

Truncating a table also would be a workaround by in our case it's a no-go as we 
create and delete tables in our tests which run back to back (create table in 
the beginning of the test and delete in the end of the test).

A nice option in our case would be to provide hbase shell utility to force 
clean up of log files manually as I realize that it's not really viable to 
clean all of those up every time some table is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2019-06-11 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861636#comment-16861636
 ] 

Andrey Elenskiy commented on HBASE-21476:
-

Hello, I'd like to restart trying to get this merged as we've been running this 
change in production for some time now and upgrades are tough because we need 
to recompile every time. However, I'm not a fan of all the "if" checks that 
I've added around and would like to make the implementation a bit more flexible.

What do you think about making java.time.Clock to be a field of HRegion (so 
each region would have its own clock as described here: 
[https://docs.google.com/a/arista.com/document/d/1LL2GAodiYi0waBz5ODGL4LDT4e_bXy8P9h6kWC05Bhw/edit?disco=AQ5zZuM])?
 Then most of the places within HRegion would use java.time.Instant and 
occasionally use a public helper function in HRegion that would convert Insant 
to either millisecond or nanoseconds (depending if region's table had 
NANOSECOND_TIMESTAMPS attribute). The nice thing about this is that nanosecond 
vs millisecond decisions would be contained to a single function and there's no 
need to pass around "isNanosecondTimestamps" everywhere. StoreScanner and 
various compaction routines would also use region's clock (instead of 
EnvironmentEdge) to make decisions which is also a good side effect. This 
approach can also be implemented in iterative steps.

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, 
> nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Labels: NEW_VERSION_BEHAVIOR query  (was: )

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.1.2
>Reporter: Andrey Elenskiy
>Priority: Major
>  Labels: NEW_VERSION_BEHAVIOR, query
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Description: 
We've enabled NEW_VERSION_BEHAVIOR on our cluster that has moderate amount of 
tiny scan requests in parallel and noticed that the heap grows to max (10Gi) 
causing GC (CMS) to kick in and significantly slowing down execution of the 
regionserver. This can be reproduced by doing 50+ scan requests on a single row 
in parallel. The heap usage goes down once the requests finish.

Looking at NewVersionBehaviorTracker, it allocates 2 TreeMap and a gazillion of 
private fields for every scan request. I haven't profiled the cause of this 
memory bomb, but would guess that NewVersionBehaviorTracker is not a small 
object to allocate so often.

Let me know if I can provide additional information.

  was:We've enabled


> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.1.2
>Reporter: Andrey Elenskiy
>Priority: Major
>  Labels: NEW_VERSION_BEHAVIOR, query
>
> We've enabled NEW_VERSION_BEHAVIOR on our cluster that has moderate amount of 
> tiny scan requests in parallel and noticed that the heap grows to max (10Gi) 
> causing GC (CMS) to kick in and significantly slowing down execution of the 
> regionserver. This can be reproduced by doing 50+ scan requests on a single 
> row in parallel. The heap usage goes down once the requests finish.
> Looking at NewVersionBehaviorTracker, it allocates 2 TreeMap and a gazillion 
> of private fields for every scan request. I haven't profiled the cause of 
> this memory bomb, but would guess that NewVersionBehaviorTracker is not a 
> small object to allocate so often.
> Let me know if I can provide additional information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Description: We've enabled

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.1.2
>Reporter: Andrey Elenskiy
>Priority: Major
>  Labels: NEW_VERSION_BEHAVIOR, query
>
> We've enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Summary: NEW_VERSION_BEHAVIOR blows up the heap  (was: NEW)

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Andrey Elenskiy
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Component/s: regionserver

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.1.2
>Reporter: Andrey Elenskiy
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Affects Version/s: 2.1.2

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2
>Reporter: Andrey Elenskiy
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21882) NEW_VERSION_BEHAVIOR blows up the heap

2019-02-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21882:

Issue Type: Bug  (was: Umbrella)

> NEW_VERSION_BEHAVIOR blows up the heap
> --
>
> Key: HBASE-21882
> URL: https://issues.apache.org/jira/browse/HBASE-21882
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrey Elenskiy
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21882) NEW

2019-02-12 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21882:
---

 Summary: NEW
 Key: HBASE-21882
 URL: https://issues.apache.org/jira/browse/HBASE-21882
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrey Elenskiy






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps

2019-01-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21476:

Attachment: HBASE-21476.branch-2.1.0004.patch

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, 
> nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2019-01-03 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733655#comment-16733655
 ] 

Andrey Elenskiy commented on HBASE-21476:
-

Added "-Dhbase.tests.nanosecond.timestamps" to run the existing tests that are 
using HBaseTestingUtility with NANOSECOND_TIMESTAMPS table attribute.  Would be 
great if someone could trigger the build with this flag since some tests 
(TestClientClusterMetrics and TestNettyIPC) timeout on my machine preventing 
from running other tests.

As for bulk imports, I don't quite know what could be updated as it's the same 
problem: it's up to the client to be aware what they are importing into what 
version of a table. It's the client that specifies the timestamps.

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, nanosecond_timestamps_v1.patch, 
> nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps

2019-01-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21476:

Attachment: HBASE-21476.branch-2.1.0003.patch

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, nanosecond_timestamps_v1.patch, 
> nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-27 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729981#comment-16729981
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Thanks for reviewing and merging!

I've created couple more followup issues:

https://issues.apache.org/jira/browse/HBASE-21654
https://issues.apache.org/jira/browse/HBASE-21653

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: NEW_VERSION_BEHAVIOR
> Fix For: 3.0.0, 2.2.0, 2.1.2, 2.0.4
>
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch, Screen 
> Shot 2018-12-24 at 10.04.57 AM.png
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21654) Sort through tests that fail with NEW_VERSION_BEHAVIOR is enabled

2018-12-27 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21654:
---

 Summary: Sort through tests that fail with NEW_VERSION_BEHAVIOR is 
enabled
 Key: HBASE-21654
 URL: https://issues.apache.org/jira/browse/HBASE-21654
 Project: HBase
  Issue Type: Umbrella
  Components: integration tests, regionserver
Affects Versions: 2.0.0
Reporter: Andrey Elenskiy


"-Dhbase.tests.new.version.behavior=true" flag was added in 
https://issues.apache.org/jira/browse/HBASE-21545 which reruns all the 
integration tests with NEW_VERSION_BEHAVIOR enabled. Some tests failed either 
due to tests checking the old behavior or due to other new bug with 
NEW_VERSION_BEHAVIOR.

So far the following test suites have failed:
 # TestKeepDeletes
 # TestMinVersions
 # TestExportSnapshot
 # TestSecureExportSnapshot
 # TestSyncTable
 # TestMobSecureExportSnapshot
 # TestThriftHBaseServiceHandler
 # TestThriftServer

Some more discussion at 
https://issues.apache.org/jira/browse/HBASE-21545?focusedCommentId=16719510=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21653) NewVersionBehaviorTracker.checkVersions() should allow cell type to be DELETE

2018-12-27 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21653:
---

 Summary: NewVersionBehaviorTracker.checkVersions() should allow 
cell type to be DELETE
 Key: HBASE-21653
 URL: https://issues.apache.org/jira/browse/HBASE-21653
 Project: HBase
  Issue Type: Bug
  Components: API
Affects Versions: 2.1.1, 2.0.0
Reporter: Andrey Elenskiy


`MajorCompactionScanQueryMatcher.match()` states that  "7. Delete marker need 
to be version counted together with puts they affect" which corresponds to a 
code path that can happen when KEEP_DELETED_CELLS is true. However, 
`NewVersionBehaviorTracker.checkVersions()` asserts that type cannot be DELETE.

The AssertionError can be verified by running 
`TestKeepDeletes.testBasicScenario` with 
"-Dhbase.tests.new.version.behavior=true" after fixing it up to work as 
expected with NEW_VERSIONS_BEHAVIOR:
{{--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestKeepDeletes.java}}
{{+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestKeepDeletes.java}}
{{@@ -135,7 +135,7 @@ public class TestKeepDeletes {}}
{{ g.setMaxVersions();}}
{{ g.setTimeRange(0L, ts+2);}}
{{ Result r = region.get(g);}}
{{-    checkResult(r, c0, c0, T2, T1);}}
{{+    checkResult(r, c0, c0, T2);}}

{{ // flush}}
{{ region.flush(true);}}

 

Some more info in the following comment: 
https://issues.apache.org/jira/browse/HBASE-21545?focusedCommentId=16719510=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-21 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16727181#comment-16727181
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Create new request on reviewboard: https://reviews.apache.org/r/69624/

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-12 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719532#comment-16719532
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

The assertion in checkVersions() seems to be another bug in 
NEW_VERSION_BEHAVIOR. Since NEW_VERSION_BEHAVIOR changes how versioning for 
deleted cells is accounted, we should expect to deleted cells to be checked. In 
fact it states so in "match()" of MajorCompactionScanQueryMatcher "7. Delete 
marker need to be version counted together with puts the affect". Should I open 
a new bug for this? Also what do you want to do about failed tests, should a 
new set of UTs be written for NEW_VERSION_BEHAVIOR where it differs with 
default one?

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-12 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719510#comment-16719510
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Started debugging with TestKeepDeletes:
TestKeepDeletes.testBasicScenario is testing old version behavior as expected 
results are different before and after flush().
TestKeepDeletes.testWithMinVersions seems to be also tailored for old version 
behavior as expected results is different after flush(), but we get the same 
one as described by NEW_VERSION_BEHAVIOR.
TestKeepDeletes.testWithTTL again expects a different result after flush() but 
we get the same one as described by NEW_VERSION_BEHAVIOR.

After fixing these tests to have the same expected result after flush(), the 
following tests are still failing because of an assertion 
"!PrivateCellUtil.isDelete(type)" in NewVersionBehaviorTracker.checkVersions() 
when region.compact(true) is called:
TestKeepDeletes.testBasicScenario:148
TestKeepDeletes.testDeleteMarkerExpiration:506
TestKeepDeletes.testDeleteMarkerVersioning:711
TestKeepDeletes.testWithMinVersions:888
TestKeepDeletes.testWithOldRow:569

with
java.lang.AssertionError
    at 
org.apache.hadoop.hbase.regionserver.querymatcher.NewVersionBehaviorTracker.checkVersions(NewVersionBehaviorTracker.java:305)
    at 
org.apache.hadoop.hbase.regionserver.querymatcher.MajorCompactionScanQueryMatcher.match(MajorCompactionScanQueryMatcher.java:80)
    at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:586)
    at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:387)
    at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
    at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
    at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
    at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1407)

Don't know if it's a new bug or it's supposed to behave this way or the test is 
wrongly structured. Will try to understand this assertion.
This looks like it's not going to be easy to verify all the tests though, this 
is something that should have been done by original NEW_VERSION_BEHAVIOR 
contributors before this code has been merged. I think that the docs in hbase 
book should be updated with a warning.

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-12 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719330#comment-16719330
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

I also have those failing on my machine.
So the tests could be failing either due to other bugs in NEW_VERSION_BEHAVIOR 
or due to the tests actually tailored to the old behavior. I'll will start 
sorting through them.

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-12 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-21545 started by Andrey Elenskiy.
---
> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-11 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717870#comment-16717870
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

[~jatsakthi] looks like I uploaded wrong patch for 4, I've fixed the 
compilation error in patch 5

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-11 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: HBASE-21545.branch-2.1.0005.patch

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch, HBASE-21545.branch-2.1.0005.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-05 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710686#comment-16710686
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

I've uploaded a patch where I modified HBaseTestingUtility to set 
NEW_VERSION_BEHAVIOR attribute in integration tests when 
"-Dhbase.tests.new.version.behavior=true" option is passed. This way we 
validate that all tests pass with this attribute. Would be great if you could 
trigger a build.

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-05 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: HBASE-21545.branch-2.1.0004.patch

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch, 
> HBASE-21545.branch-2.1.0004.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709497#comment-16709497
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Ok, I've attached patch with the fix and a unit test for 
NewVersionBehaviorTracker with columns.
Took opportunity to refactor checkColumn function to be a bit easier to follow 
(pretty much same as ExplicitColumnTracker).

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: HBASE-21545.branch-2.1.0003.patch

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch, HBASE-21545.branch-2.1.0003.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Affects Version/s: 2.0.0

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.0.0, 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy reassigned HBASE-21545:
---

Assignee: Andrey Elenskiy  (was: Sakthi)

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709411#comment-16709411
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Got curious to learn and dig through the code. I believe I've found the issue 
for the bug.

In 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/NewVersionBehaviorTracker.java
 is should be:

{{   public boolean done() {}}
{{-    // lastCq* have been updated to this cell.}}
{{-    return !(columns == null || lastCqArray == null) && Bytes}}
{{-    .compareTo(lastCqArray, lastCqOffset, lastCqLength, 
columns[columnIndex], 0,}}
{{-    columns[columnIndex].length) > 0;}}
{{+    return columnIndex >= columns.length;}}
{{   }}}

The reason it fails is because lastCq gets updated to the current cell while 
columnIndex hasn't been advanced from the already included column. Here's an 
example:
Columns A, B and C are in the row.
Get request with columns A and C.

1. {{lastCq*}} gets updated to A
2. checkColumn gets called on A, A gets included, columnIndex is on A
3. {{lastCq*}} gets updated to B
4. checkColumn calls done(), done checks if B > A (columnIndex is on A) which 
returns true
5. we are done with request

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Sakthi
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709353#comment-16709353
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

Oh, I'm not planning on working on a fix as I don't have enough knowledge about 
the moving pieces. [~busbey] asked for a reproduction test so I converted it :)

 

On a side note, this seems like a bug that should have been caught by existing 
tests if they were converted to use new version behavior attribute. Do you 
think ti would be possible to run all the tests again with this option enabled 
by default and see what other bugs come out?

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Sakthi
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2018-12-04 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709331#comment-16709331
 ] 

Andrey Elenskiy commented on HBASE-21476:
-

> Is there any way we could make this a hard enforcement? Maybe a new optional 
> way to flag that a client write has been done using nanoseconds?

Yea, it would be useful as most developers probably look up code snippets 
online to talk to HBase and I can see someone ending up providing milliseconds 
even if they created table with NANOSECOND_TIMESTAMPS attribute. However, I 
don't think it's possible to do this 100% reliably. One way is to make an 
assumption that timestamp cannot be smaller than 2,000,000,000 (second second 
since epoch in nanos or year 2033 in millis), but that seems like a limiting 
assumption. We could add option to disabled this check though but then they 
could run into a situation when timestamp is small at a runtime and break their 
application with this exception.

> we could proactively reject operations from old clients

Old clients still work with this attribute, users would have to update code 
everywhere to use java.time.Instant so I don't really see how checking client 
version would help here.

Another idea would be disallow altering table to set NANOSECOND_TIMESTAMPS 
attribute. The table has to be created with nanos from scratch. I assume that 
when the table created first time with this attribute, the users are starting 
to write a new application from scratch on dev environment. This would 
hopefully rule out the case of clients adding this attribute by accident to 
existing table and corrupting their data.

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: HBASE-21545.branch-2.1.0002.patch

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Sakthi
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709338#comment-16709338
 ] 

Andrey Elenskiy commented on HBASE-21545:
-

ignore the first patch, the second patch has a test that reproduces the issue.

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Sakthi
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch, 
> HBASE-21545.branch-2.1.0002.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-04 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: HBASE-21545.branch-2.1.0001.patch

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java, HBASE-21545.branch-2.1.0001.patch
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21545:
---

 Summary: NEW_VERSION_BEHAVIOR breaks Get/Scan with specified 
columns
 Key: HBASE-21545
 URL: https://issues.apache.org/jira/browse/HBASE-21545
 Project: HBase
  Issue Type: New Feature
  Components: API
Affects Versions: 2.1.1
 Environment: HBase 2.1.1

Hadoop 2.8.4

Java 8
Reporter: Andrey Elenskiy


Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
column to be returned when columns are specified in Scan or Get query. The 
result is always one first column by sorted order. I've attached a code snipped 
to reproduce the issue that can be converted into a test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Tags: get, scan,   (was: get, scan)

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Tags: get, scan, NEW_VERSION_BEHAVIOR  (was: get, scan, )

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Issue Type: Bug  (was: New Feature)

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Description: 
Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
column to be returned when columns are specified in Scan or Get query. The 
result is always one first column by sorted order. I've attached a code snipped 
to reproduce the issue that can be converted into a test.

I've also validated with hbase shell and gohbase client, so it's gotta be 
server side issue.

  was:Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
column to be returned when columns are specified in Scan or Get query. The 
result is always one first column by sorted order. I've attached a code snipped 
to reproduce the issue that can be converted into a test.


> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: New Feature
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.
> I've also validated with hbase shell and gohbase client, so it's gotta be 
> server side issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21545) NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns

2018-12-03 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21545:

Attachment: App.java

> NEW_VERSION_BEHAVIOR breaks Get/Scan with specified columns
> ---
>
> Key: HBASE-21545
> URL: https://issues.apache.org/jira/browse/HBASE-21545
> Project: HBase
>  Issue Type: New Feature
>  Components: API
>Affects Versions: 2.1.1
> Environment: HBase 2.1.1
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> Setting NEW_VERSION_BEHAVIOR => 'true' on a column family causes only one 
> column to be returned when columns are specified in Scan or Get query. The 
> result is always one first column by sorted order. I've attached a code 
> snipped to reproduce the issue that can be converted into a test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2018-12-03 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708100#comment-16708100
 ] 

Andrey Elenskiy commented on HBASE-21476:
-

> Please use {{git format-patch}} to create future patches.

(y)

> What happens if a client that doesn't support nanoseconds attempts to write 
> to a table that is configured for nanoseconds?

There's no error of any sort unless 
"base.hregion.keyvalue.timestamp.slop.millisecs" is specified. The value will 
be stored at the millisecond timestamp. It's up to the client to be careful 
here.

> Do we need to account for this table attribute when bulk loading?

Yes, that's a good point. As bulk import spits out hfiles directly, we should 
also account for this table attribute when adding keyvalues/puts. Will update 
the patch.

> What about snapshots? do they retain information on wether their contents use 
> nanoseconds? Do tables cloned from a snapshot have to have the same 
> nanosecond config as the snapshot?

yes, as snapshots include table attributes within `data.manifest`, cloning 
table from a snapshot will also include NANOSECOND_TIMESTAMPS attribute.

> I see the WIP patches are starting to address MOB handling, but I don't see 
> it mentioned in the scope document at all.

It's mentioned at the end of Technical Approach section, it's part of the 
effort to pass tabledescriptor to compactions.

Will clarify update doc to clarify these points and also look into test 
failures.

 

 

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps

2018-11-30 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21476:

Attachment: nanosecond_timestamps_v2.patch

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> nanosecond_timestamps_v1.patch, nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2018-11-14 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687312#comment-16687312
 ] 

Andrey Elenskiy commented on HBASE-21476:
-

[~busbey] thanks for providing the examples of scope documents, those were 
helpful.

I've attached one to this issue as requested.

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> nanosecond_timestamps_v1.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps

2018-11-14 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21476:

Attachment: Apache HBase - Nanosecond Timestamps v1.pdf

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> nanosecond_timestamps_v1.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21476) Support for nanosecond timestamps

2018-11-13 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21476:
---

 Summary: Support for nanosecond timestamps
 Key: HBASE-21476
 URL: https://issues.apache.org/jira/browse/HBASE-21476
 Project: HBase
  Issue Type: New Feature
Affects Versions: 2.1.1
Reporter: Andrey Elenskiy
Assignee: Andrey Elenskiy
 Attachments: nanosecond_timestamps_v1.patch

Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
handle timestamps with nanosecond precision. This is useful for applications 
that timestamp updates at the source with nanoseconds and still want features 
like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.

The attribute should be specified either on new tables or on existing tables 
which have timestamps only with nanosecond precision. There's no migration from 
milliseconds to nanoseconds for already existing tables. We could add this 
migration as part of compaction if you think that would be useful, but that 
would obviously make the change more complex.

I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
[java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
 to get time in nanoseconds which means it will only work with Java 8. The idea 
is to gradually replace all places where "EnvironmentEdge.currentTime()" is 
used to have HBase working purely with nanoseconds (which is a prerequisite for 
HBASE-14070). Also, I've refactored ScanInfo and PartitionedMobCompactor to 
expect TableDescriptor as an argument which makes code a little cleaner and 
easier to extend.

Couple more points:
- column family TTL (specified in seconds) and 
"hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options don't 
need to be changed, those are adjusted automatically.
- Per cell TTL needs to be scaled by clients accordingly after 
"NANOSECOND_TIMESTAMPS" table attribute is specified.

Looking for everyone's feedback to know if that's a worthwhile direction. Will 
add more comprehensive tests in a later patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each

2018-08-21 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587958#comment-16587958
 ] 

Andrey Elenskiy commented on HBASE-21032:
-

Even though we don't use read replicas, it seems like it's the same issue. The 
last row loaded from hbase:meta was:

18/08/21 19:46:21 INFO assignment.RegionStateStore: Load hbase:meta entry 
region=2ad4d95f7b9d9ba0d746b8da50a7f9a7, regionState=OPEN, 
lastHost=regionserver-0,16020,1534538897512, 
regionLocation=regionserver-3,16020,1533765879856, openSeqNum=172

Which, based on region encoded name, is hbase:namespace. However, the 
regionserver is different.

I think I'm going to halt deploying the patched version for now. Instead, I've 
run App.java that I've attached and it looks like it behaves as expected.

> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.1
>
> Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, 
> HBASE-21032.patch
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> ScanResponse being around 2MB with multiple cells while all of the consequent 
> ones being 1 cell per ScanResponse. After digging more, I found that each of 
> those single cell ScanResponse partials are preceded by a heartbeat (zero 
> cells). This results in two requests per cell to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each

2018-08-21 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587865#comment-16587865
 ] 

Andrey Elenskiy commented on HBASE-21032:
-

I'm hitting another issue I've had before with 2.1 when upgrading dev cluster. 
HBase master is stuck initializing because it can't get to hbase:namespace 
region:

{{18/08/21 18:52:28 INFO client.RpcRetryingCallerImpl: Call exception, 
tries=32, retries=46, started=471534 ms ago, cancelled=false, 
msg=org.apache.hadoop.hbase.NotServingRegionException: 
hbase:namespace,,1508805323559.2ad4d95f7b9d9ba0d746b8da50a7f9a7. is not online 
on regionserver-0,16020,1534877035908}}
{{    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3287)}}
{{    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3264)}}
{{    at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1428)}}
{{    at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2443)}}
{{    at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)}}
{{    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)}}
{{    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)}}
{{    at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)}}
{{    at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)}}
{{, details=row 'default' on table 'hbase:namespace' at 
region=hbase:namespace,,1508805323559.2ad4d95f7b9d9ba0d746b8da50a7f9a7., 
hostname=regionserver-0,16020,1534538897512, seqNum=172}}

 

And there's nothing about hbase:namespace in regionserver-0's logs. Don't 
really know how to work around this.

> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.1
>
> Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, 
> HBASE-21032.patch
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> ScanResponse being around 2MB with multiple cells while all of the consequent 
> ones being 1 cell per ScanResponse. After digging more, I found that each of 
> those single cell ScanResponse partials are preceded by a heartbeat (zero 
> cells). This results in two requests per cell to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each

2018-08-21 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587732#comment-16587732
 ] 

Andrey Elenskiy commented on HBASE-21032:
-

Sure, giving it a try by applying to 5a40eae63e290c8a12b1e7d4dd01fc98ba09573d 
of branch-2.1 and deploying onto our dev.

> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.1
>
> Attachments: App.java, HBASE-21032-v1.patch, HBASE-21032-v1.patch, 
> HBASE-21032.patch
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> ScanResponse being around 2MB with multiple cells while all of the consequent 
> ones being 1 cell per ScanResponse. After digging more, I found that each of 
> those single cell ScanResponse partials are preceded by a heartbeat (zero 
> cells). This results in two requests per cell to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21032) ScanResponses contain only one cell each

2018-08-10 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576525#comment-16576525
 ] 

Andrey Elenskiy commented on HBASE-21032:
-

Yes, it should always try to fit MaxResultSize of Cells into a partial 
ScanResponse. You can try out the difference by running the code I provided on 
hbase 2.0.0 (will return only ~2-3 results) and hbase 2.1.0 returns ~260 
ScanResponses (2X that if you account for heartbeats).

> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> ScanResponse being around 2MB with multiple cells while all of the consequent 
> ones being 1 cell per ScanResponse. After digging more, I found that each of 
> those single cell ScanResponse partials are preceded by a heartbeat (zero 
> cells). This results in two requests per cell to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21032) ScanResponses contain only one cell each

2018-08-09 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21032:

Description: 
I have a long row with a bunch of columns that I'm scanning with 
setAllowPartialResults(true). In the response I'm getting the first partial 
ScanResponse being around 2MB with multiple cells while all of the consequent 
ones being 1 cell per ScanResponse. After digging more, I found that each of 
those single cell ScanResponse partials are preceded by a heartbeat (zero 
cells). This results in two requests per cell to a regionserver.

I've attached code to reproduce it on hbase version 2.1.0 (it works as expected 
on 2.0.0 and 2.0.1).

[^App.java]

I'm fairly certain it's a serverside issue as 
[gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
have not tried to reproduce this with multi-row scan.

  was:
I have a long row with a bunch of columns that I'm scanning with 
setAllowPartialResults(true). In the response I'm getting the first partial 
being around 2MB while all of the consequent ones being 1 column per partial. 
After digging more, I found that each of those single column partials are 
preceded by a heartbeat response (zero cells). This results in two request per 
column to a regionserver.

I've attached code to reproduce it on hbase version 2.1.0 (it works as expected 
on 2.0.0 and 2.0.1).

[^App.java]

I'm fairly certain it's a serverside issue as 
[gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
have not tried to reproduce this with multi-row scan.


> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> ScanResponse being around 2MB with multiple cells while all of the consequent 
> ones being 1 cell per ScanResponse. After digging more, I found that each of 
> those single cell ScanResponse partials are preceded by a heartbeat (zero 
> cells). This results in two requests per cell to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21032) ScanResponses contain only one cell each

2018-08-09 Thread Andrey Elenskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-21032:

Summary: ScanResponses contain only one cell each  (was: ScanResponse 
returns a partial result per cell)

> ScanResponses contain only one cell each
> 
>
> Key: HBASE-21032
> URL: https://issues.apache.org/jira/browse/HBASE-21032
> Project: HBase
>  Issue Type: Bug
>  Components: Scanners
>Affects Versions: 2.1.0
> Environment: HBase 2.1.0
> Hadoop 2.8.4
> Java 8
>Reporter: Andrey Elenskiy
>Priority: Major
> Attachments: App.java
>
>
> I have a long row with a bunch of columns that I'm scanning with 
> setAllowPartialResults(true). In the response I'm getting the first partial 
> being around 2MB while all of the consequent ones being 1 column per partial. 
> After digging more, I found that each of those single column partials are 
> preceded by a heartbeat response (zero cells). This results in two request 
> per column to a regionserver.
> I've attached code to reproduce it on hbase version 2.1.0 (it works as 
> expected on 2.0.0 and 2.0.1).
> [^App.java]
> I'm fairly certain it's a serverside issue as 
> [gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
> have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21032) ScanResponse returns a partial result per cell

2018-08-09 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-21032:
---

 Summary: ScanResponse returns a partial result per cell
 Key: HBASE-21032
 URL: https://issues.apache.org/jira/browse/HBASE-21032
 Project: HBase
  Issue Type: Bug
  Components: Scanners
Affects Versions: 2.1.0
 Environment: HBase 2.1.0

Hadoop 2.8.4

Java 8
Reporter: Andrey Elenskiy
 Attachments: App.java

I have a long row with a bunch of columns that I'm scanning with 
setAllowPartialResults(true). In the response I'm getting the first partial 
being around 2MB while all of the consequent ones being 1 column per partial. 
After digging more, I found that each of those single column partials are 
preceded by a heartbeat response (zero cells). This results in two request per 
column to a regionserver.

I've attached code to reproduce it on hbase version 2.1.0 (it works as expected 
on 2.0.0 and 2.0.1).

[^App.java]

I'm fairly certain it's a serverside issue as 
[gohbase|https://github.com/tsuna/gohbase] client is having the same issue. I 
have not tried to reproduce this with multi-row scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-16110) AsyncFS WAL doesn't work with Hadoop 2.8+

2018-07-02 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530323#comment-16530323
 ] 

Andrey Elenskiy edited comment on HBASE-16110 at 7/2/18 7:14 PM:
-

Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 
2.8.4 client 
([http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/]). Got 
the following exception on regionserver which brings it down:


{{ 18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete()}}
{{ java.lang.Error: Couldn't properly initialize access to HDFS internals. 
Please update your WAL Provider to not make use of the 'asyncfs' provider. See 
HBASE-16110 for more information.}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.access$300(FanOutOneBlockAsyncDFSOutputHelper.java:118)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:720)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:715)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:638)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:676)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:552)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)}}
{{     at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)}}
{{     at java.lang.Thread.run(Thread.java:748)}}
{{ Caused by: java.lang.NoSuchMethodException: 
org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(org.apache.hadoop.fs.FileEncryptionInfo)}}
{{     at java.lang.Class.getDeclaredMethod(Class.java:2130)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.createTransparentCryptoHelper(FanOutOneBlockAsyncDFSOutputSaslHelper.java:232)}}
{{     at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:262)}}
{{     ... 18 more}}

 

FYI, we don't have encryption enabled. Let me know if you need more info about 
our setup.


was (Author: timoha):
Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 
2.8.4 client 
(http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/). Got 
the following exception on regionserver which brings it down:
```
18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete()
java.lang.Error: Couldn't properly initialize access to HDFS internals. Please 
update your WAL Provider to not make use of the 'asyncfs' provider. See 
HBASE-16110 for more information.
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661)
    at 

[jira] [Commented] (HBASE-16110) AsyncFS WAL doesn't work with Hadoop 2.8+

2018-07-02 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530323#comment-16530323
 ] 

Andrey Elenskiy commented on HBASE-16110:
-

Hello, we are running HBase 2.0.1 with official Hadoop 2.8.4 jars and hadoop 
2.8.4 client 
(http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/2.8.4/). Got 
the following exception on regionserver which brings it down:
```
18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was thrown by 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete()
java.lang.Error: Couldn't properly initialize access to HDFS internals. Please 
update your WAL Provider to not make use of the 'asyncfs' provider. See 
HBASE-16110 for more information.
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.access$300(FanOutOneBlockAsyncDFSOutputHelper.java:118)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:720)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:715)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
    at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
    at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:638)
    at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:676)
    at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:552)
    at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
    at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodException: 
org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(org.apache.hadoop.fs.FileEncryptionInfo)
    at java.lang.Class.getDeclaredMethod(Class.java:2130)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.createTransparentCryptoHelper(FanOutOneBlockAsyncDFSOutputSaslHelper.java:232)
    at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.(FanOutOneBlockAsyncDFSOutputSaslHelper.java:262)
    ... 18 more
```

FYI, we don't have encryption enabled.

> AsyncFS WAL doesn't work with Hadoop 2.8+
> -
>
> Key: HBASE-16110
> URL: https://issues.apache.org/jira/browse/HBASE-16110
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0
>Reporter: Sean Busbey
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: HBASE-16110-v1.patch, HBASE-16110.patch
>
>
> The async wal implementation doesn't work with Hadoop 2.8+. Fails compilation 
> and will fail running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-16244) LocalHBaseCluster start timeout should be configurable

2017-08-23 Thread Andrey Elenskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138860#comment-16138860
 ] 

Andrey Elenskiy commented on HBASE-16244:
-

any reason this fix didn't make it to 1.3.x? We are hitting this in our 
integration tests as well on 1.3.1.

> LocalHBaseCluster start timeout should be configurable
> --
>
> Key: HBASE-16244
> URL: https://issues.apache.org/jira/browse/HBASE-16244
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.0.1.1
>Reporter: Siddharth Wagle
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16244.patch
>
>
> *Scenario*:
> - Ambari metrics service uses HBase in standalone mode
> - On restart of AMS HBase, the Master gives up in 30 seconds due to a 
> hardcoded timeout in JVMClusterUtil
> {noformat}
> 2016-07-18 19:24:44,199 ERROR [main] master.HMasterCommandLine: Master exiting
> java.lang.RuntimeException: Master not active after 30 seconds
> at 
> org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:194)
> at 
> org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:445)
> at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:227)
> at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:139)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2526)
> {noformat}
> - On restart the current Master waits to become active and this leads to the 
> timeout being triggered, waiting for a slightly longer time evades this issue.
> - The timeout it seems was meant for unit tests
> Attached patch allows the timeout to be configured via hbase-site as well as 
> sets it to 5 minutes for clusters started through HMasterCommandLine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18066) Get with closest_row_before on "hbase:meta" can return empty Cell during region merge/split

2017-05-16 Thread Andrey Elenskiy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Elenskiy updated HBASE-18066:

Description: 
During region split/merge there's a brief period of time where doing a "Get" 
with "closest_row_before=true" on "hbase:meta" may return empty 
"GetResponse.result.cell" field even though parent, splitA and splitB regions 
are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and 
AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as 
"TableDoesNotExist", which is returned to the client.

Here's a gist that reproduces this problem: 
https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you 
have to use older HTable client (I used 1.2.4) as current versions ignore 
`Get.setClosestRowBefore(bool)` option.

  was:
During region split/merge there's a brief period of time where doing a "Get" 
with "closest_row_before=true" on "hbase:meta" may return empty 
"GetResponse.result.cell" field even though parent, splitA and splitB regions 
are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and 
AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as 
"TableDoesNotExist" which is returned to the client.

Here's a gist that reproduces this problem: 
https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you 
have to use older HTable client (I used 1.2.4) as current versions ignore 
`Get.setClosestRowBefore(bool)` option.


> Get with closest_row_before on "hbase:meta" can return empty Cell during 
> region merge/split
> ---
>
> Key: HBASE-18066
> URL: https://issues.apache.org/jira/browse/HBASE-18066
> Project: HBase
>  Issue Type: Bug
>  Components: hbase, regionserver
>Affects Versions: 1.3.1
> Environment: Linux (16.04.2), MacOS 10.11.6.
> Standalone and distributed HBase setup.
>Reporter: Andrey Elenskiy
>
> During region split/merge there's a brief period of time where doing a "Get" 
> with "closest_row_before=true" on "hbase:meta" may return empty 
> "GetResponse.result.cell" field even though parent, splitA and splitB regions 
> are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and 
> AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as 
> "TableDoesNotExist", which is returned to the client.
> Here's a gist that reproduces this problem: 
> https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that 
> you have to use older HTable client (I used 1.2.4) as current versions ignore 
> `Get.setClosestRowBefore(bool)` option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18066) Get with closest_row_before on "hbase:meta" can return empty Cell during region merge/split

2017-05-16 Thread Andrey Elenskiy (JIRA)
Andrey Elenskiy created HBASE-18066:
---

 Summary: Get with closest_row_before on "hbase:meta" can return 
empty Cell during region merge/split
 Key: HBASE-18066
 URL: https://issues.apache.org/jira/browse/HBASE-18066
 Project: HBase
  Issue Type: Bug
  Components: hbase, regionserver
Affects Versions: 1.3.1
 Environment: Linux (16.04.2), MacOS 10.11.6.
Standalone and distributed HBase setup.
Reporter: Andrey Elenskiy


During region split/merge there's a brief period of time where doing a "Get" 
with "closest_row_before=true" on "hbase:meta" may return empty 
"GetResponse.result.cell" field even though parent, splitA and splitB regions 
are all in "hbase:meta". Both gohbase (https://github.com/tsuna/gohbase) and 
AsyncHBase (https://github.com/OpenTSDB/asynchbase) interprets this as 
"TableDoesNotExist" which is returned to the client.

Here's a gist that reproduces this problem: 
https://gist.github.com/Timoha/c7a236b768be9220e85e53e1ca53bf96. Note that you 
have to use older HTable client (I used 1.2.4) as current versions ignore 
`Get.setClosestRowBefore(bool)` option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)