[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21421:
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.0.3
   3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+, thanks for reviewing,[~Apache9].

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 3.0.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, 
> HBASE-21421.branch-2.0.004.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676234#comment-16676234
 ] 

Allan Yang commented on HBASE-21440:


Yes, I think it is possible. But it should be a very rare case. Since if an 
AssignProcedure is in REGION_TRANSITION_FINISH state, it should have finished 
long time ago before the corresponding SCP begin to handle RIT. But, anyway, it 
can happen, e.g. the meta table is also on the crashed server, so the 
AssignProcedure is stucking there retrying to update meta, while sleeping, the 
SCP called remoteCallFailed on this AssignProcedure.
I think we can do this. if the procedure found in handleRIT is a assign 
procedure, and it is in REGION_TRANSITION_FINISH , we should not remove it from 
regions to assign list.
We can count on AssignProcedure's state, since, if it do not enter 
REGION_TRANSITION_FINISH state when handleRIT, it won't have a chance later(the 
server is already dead).

> Assign procedure on the crashed server is not properly interrupted
> --
>
> Key: HBASE-21440
> URL: https://issues.apache.org/jira/browse/HBASE-21440
> Project: HBase
>  Issue Type: Bug
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Fix For: 2.0.2
>
>
> When the server crashes, it's SCP checks if there is already a procedure 
> assigning the region on this crashed server. If we found one, SCP will just 
> interrupt the already running AssignProcedure by calling remoteCallFailed 
> which internally just changes the region node state to OFFLINE and send the 
> procedure back with transition queue state for assignment with a new plan. 
> But, due to the race condition between the calling of the remoteCallFailed 
> and current state of the already running assign 
> procedure(REGION_TRANSITION_FINISH: where the region is already opened), it 
> is possible that assign procedure goes ahead in updating the regionStateNode 
> to OPEN on a crashed server. 
> As SCP had already skipped this region for assignment as it was relying on 
> existing assign procedure to do the right thing, this whole confusion leads 
> region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21439) StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost functions

2018-11-05 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676182#comment-16676182
 ] 

stack commented on HBASE-21439:
---

bq. Will post a patch assuming we want to pursue the original intention...

Makes sense [~benlau]. Thanks.

Do we make this same mistake elsewhere in codebase?

Thanks.

> StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost 
> functions
> -
>
> Key: HBASE-21439
> URL: https://issues.apache.org/jira/browse/HBASE-21439
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Affects Versions: 1.3.2.1, 2.0.2
>Reporter: Ben Lau
>Assignee: Ben Lau
>Priority: Major
>
> In StochasticLoadBalancer.updateRegionLoad() the region loads are being put 
> into the map with Bytes.toString(regionName).
> First, this is a problem because Bytes.toString() assumes that the byte array 
> is a UTF8 encoded String but there is no guarantee that regionName bytes are 
> legal UTF8.
> Secondly, in BaseLoadBalancer.registerRegion, we are reading the region loads 
> out of the load map not using Bytes.toString() but using 
> region.getRegionNameAsString() and region.getEncodedName().  So the load 
> balancer will not see or use any of the cluster's RegionLoad history.
> There are 2 primary ways to solve this issue, assuming we want to stay with 
> String keys for the load map (seems reasonable to aid debugging).  We can 
> either fix updateRegionLoad to store the regionName as a string properly or 
> we can update both the reader & writer to use a new common valid String 
> representation.
> Will post a patch assuming we want to pursue the original intention, i.e. 
> store regionNameAsAString for the loadmap key, but I'm open to fixing this a 
> different way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21430) [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo

2018-11-05 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676164#comment-16676164
 ] 

stack commented on HBASE-21430:
---

I put a notice up on dev list that will change gitbox so updates are instead 
mail to issues@ rather than JIRA comments. Here's the INFRA: INFRA-16886

> [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo
> -
>
> Key: HBASE-21430
> URL: https://issues.apache.org/jira/browse/HBASE-21430
> Project: HBase
>  Issue Type: Bug
>  Components: hbase-connectors, spark
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Exploring moving the spark modules out of core hbase and into 
> hbase-connectors. Perhaps spark is deserving of its own repo (I think 
> [~busbey] was on about this) but meantime, experimenting w/ having it out in 
> hbase-connectors.
> Here is thread on spark integration 
> https://lists.apache.org/thread.html/fd74ef9b9da77abf794664f06ea19c839fb3d543647fb29115081683@%3Cdev.hbase.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21441) NPE if RS crashes between REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE

2018-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21441:
--
Description: 
{noformat}
2018-11-06,12:55:25,980 WARN 
[RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] 
org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer 
TestPeer for TRANSIT_SYNC_REPLICATION_STATE on 
c4-hadoop-tst-st54.bj,17200,1541479922465 failed
java.lang.NullPointerException via 
c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303)
at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34)
at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
2018-11-06,12:55:25,980 WARN 
[RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] 
org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer 
TestPeer for TRANSIT_SYNC_REPLICATION_STATE on 
c4-hadoop-tst-st54.bj,17200,1541479922465 failed
java.lang.NullPointerException via 
c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303)
at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34)
at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


> NPE if RS crashes between REFRESH_PEER_

[jira] [Created] (HBASE-21441) NPE if RS crashes between REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE

2018-11-05 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-21441:
-

 Summary: NPE if RS crashes between 
REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and 
TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE
 Key: HBASE-21441
 URL: https://issues.apache.org/jira/browse/HBASE-21441
 Project: HBase
  Issue Type: Sub-task
  Components: Replication
Reporter: Duo Zhang
 Fix For: 3.0.0


2018-11-06,12:55:25,980 WARN 
[RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] 
org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer 
TestPeer for TRANSIT_SYNC_REPLICATION_STATE on 
c4-hadoop-tst-st54.bj,17200,1541479922465 failed
java.lang.NullPointerException via 
c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303)
at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34)
at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676107#comment-16676107
 ] 

Hudson commented on HBASE-21423:


Results for branch branch-2.0
[build #1061 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread liubangchen (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676090#comment-16676090
 ] 

liubangchen commented on HBASE-21381:
-

Not at all . If some thing I can do it just say . Thanks. 

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, 
> HBASE-21381-3.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21381:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, liubang.

Thanks for the review, Wei-Chiu

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, 
> HBASE-21381-3.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676067#comment-16676067
 ] 

Wei-Chiu Chuang commented on HBASE-21381:
-

+1 (non-binding) Thanks!

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, 
> HBASE-21381-3.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Reid Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676063#comment-16676063
 ] 

Reid Chan commented on HBASE-21246:
---

I'm reading the diagrams, thanks for the work!

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)

2018-11-05 Thread Reid Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676057#comment-16676057
 ] 

Reid Chan commented on HBASE-21255:
---

Thanks Sean!

Would you mind taking a look at {{ShadedAccessControlUtil.class}}, i think 
checkstyle of this class can be ignored and acceptable, WDYT? 

> [acl] Refactor TablePermission into three classes (Global, Namespace, Table)
> 
>
> Key: HBASE-21255
> URL: https://issues.apache.org/jira/browse/HBASE-21255
> Project: HBase
>  Issue Type: Improvement
>Reporter: Reid Chan
>Assignee: Reid Chan
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21225.master.001.patch, 
> HBASE-21225.master.002.patch, HBASE-21255.master.003.patch, 
> HBASE-21255.master.004.patch, HBASE-21255.master.005.patch, 
> HBASE-21255.master.006.patch
>
>
> A TODO in {{TablePermission.java}}
> {code:java}
>   //TODO refactor this class
>   //we need to refacting this into three classes (Global, Table, Namespace)
> {code}
> Change Notes:
>  * Divide origin TablePermission into three classes GlobalPermission, 
> NamespacePermission, TablePermission
>  * New UserPermission consists of a user name and a permission in one of 
> [Global, Namespace, Table]Permission.
>  * Rename TableAuthManager to AuthManager(it is IA.P), and rename some 
> methods for readability.
>  * Make PermissionCache thread safe, and the ListMultiMap is changed to Set.
>  * User cache and group cache in AuthManager is combined together.
>  * Wire proto is kept, BC should be under guarantee.
>  * Fix HBASE-21390.
>  * Resolve a small {{TODO}} global entry should be handled differently in 
> AccessControlLists



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676052#comment-16676052
 ] 

Ted Yu commented on HBASE-21347:


lgtm

> Backport HBASE-21200 "Memstore flush doesn't finish because of 
> seekToPreviousRow() in memstore scanner." to branch-1
> 
>
> Key: HBASE-21347
> URL: https://issues.apache.org/jira/browse/HBASE-21347
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, Scanners
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Critical
> Attachments: HBASE-21347.branch-1.001.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676010#comment-16676010
 ] 

Hadoop QA commented on HBASE-21381:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
48s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  4m 
56s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  4m 
55s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 20m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21381 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12947001/HBASE-21381-3.patch |
| Optional Tests |  dupname  asflicense  refguide  |
| uname | Linux 8d0b2729e890 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 01603278a3 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| refguide | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14960/artifact/patchprocess/branch-site/book.html
 |
| refguide | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14960/artifact/patchprocess/patch-site/book.html
 |
| Max. process+thread count | 97 (vs. ulimit of 1) |
| modules | C: . U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14960/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, 
> HBASE-21381-3.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)

[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676002#comment-16676002
 ] 

Allan Yang commented on HBASE-19953:


[~elserj], actually postDeleteNamespace still have the race condition you 
mentioned above... Since the sync latch is released just after 
DELETE_NAMESPACE_PREPARE state in DeleteNamespaceProcedure. So when the time 
postDeleteNamespace is called, the namespace may also exists...
But anyway, let me open another issue for it, the thing I want to solve most is 
the RPC timeout of modifying tables. 

> Avoid calling post* hook when procedure fails
> -
>
> Key: HBASE-19953
> URL: https://issues.apache.org/jira/browse/HBASE-19953
> Project: HBase
>  Issue Type: Bug
>  Components: master, proc-v2
>Reporter: Ramesh Mani
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.0-beta-2, 2.0.0
>
> Attachments: HBASE-19952.001.branch-2.patch, 
> HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, 
> HBASE-19953.branch-2.0.addendum.patch
>
>
> Ramesh pointed out a case where I think we're mishandling some post\* 
> MasterObserver hooks. Specifically, I'm looking at the deleteNamespace.
> We synchronously execute the DeleteNamespace procedure. When the user 
> provides a namespace that isn't empty, the procedure does a rollback (which 
> is just a no-op), but this doesn't propagate an exception up to the 
> NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh 
> pointing it out a bit better to me that the code executes a bit differently 
> than we actually expect.
> I think we need to double-check our post hooks and make sure we aren't 
> invoking them when the procedure actually failed. cc/ [~Apache9], [~stack].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1

2018-11-05 Thread Toshihiro Suzuki (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675997#comment-16675997
 ] 

Toshihiro Suzuki commented on HBASE-21347:
--

Could you please review the patch? [~elserj] [~yuzhih...@gmail.com] [~apurtell]

> Backport HBASE-21200 "Memstore flush doesn't finish because of 
> seekToPreviousRow() in memstore scanner." to branch-1
> 
>
> Key: HBASE-21347
> URL: https://issues.apache.org/jira/browse/HBASE-21347
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, Scanners
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Critical
> Attachments: HBASE-21347.branch-1.001.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread liubangchen (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liubangchen updated HBASE-21381:

Attachment: HBASE-21381-3.patch

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, 
> HBASE-21381-3.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675995#comment-16675995
 ] 

Allan Yang commented on HBASE-21421:


OK, let me fix the checkstyle on commit

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, 
> HBASE-21421.branch-2.0.004.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21314:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+.

Thanks [~allan163] for reviewing.

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21438:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.1.2
   2.0.3
   2.2.0
   3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+. Thanks [~tedyu].

> TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
> --
>
> Key: HBASE-21438
> URL: https://issues.apache.org/jira/browse/HBASE-21438
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: 21438.v1.txt
>
>
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
>  :
> {code}
> Mon Nov 05 04:52:13 UTC 2018, 
> RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
> org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and 
> have an empty constructor
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers

2018-11-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675963#comment-16675963
 ] 

Duo Zhang commented on HBASE-21247:
---

If this is a bug then we should push it to all related branches?

> Custom WAL Provider cannot be specified by configuration whose value is 
> outside the enums in Providers
> --
>
> Key: HBASE-21247
> URL: https://issues.apache.org/jira/browse/HBASE-21247
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, 
> 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, 
> 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt
>
>
> Currently all the WAL Providers acceptable to hbase are specified in 
> Providers enum of WALFactory.
> This restricts the ability for additional WAL Providers to be supplied - by 
> class name.
> This issue fixes the bug by allowing the specification of new WAL Provider 
> class name using the config "hbase.wal.provider".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: two-pass-cleaner.v6.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: (was: two-pass-cleaner.v6.txt)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675949#comment-16675949
 ] 

Duo Zhang commented on HBASE-21438:
---

+1.

> TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
> --
>
> Key: HBASE-21438
> URL: https://issues.apache.org/jira/browse/HBASE-21438
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21438.v1.txt
>
>
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
>  :
> {code}
> Mon Nov 05 04:52:13 UTC 2018, 
> RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
> org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and 
> have an empty constructor
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20952) Re-visit the WAL API

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675944#comment-16675944
 ] 

Hudson commented on HBASE-20952:


Results for branch HBASE-20952
[build #40 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Re-visit the WAL API
> 
>
> Key: HBASE-20952
> URL: https://issues.apache.org/jira/browse/HBASE-20952
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
> Attachments: 20952.v1.txt
>
>
> Take a step back from the current WAL implementations and think about what an 
> HBase WAL API should look like. What are the primitive calls that we require 
> to guarantee durability of writes with a high degree of performance?
> The API needs to take the current implementations into consideration. We 
> should also have a mind for what is happening in the Ratis LogService (but 
> the LogService should not dictate what HBase's WAL API looks like RATIS-272).
> Other "systems" inside of HBase that use WALs are replication and 
> backup&restore. Replication has the use-case for "tail"'ing the WAL which we 
> should provide via our new API. B&R doesn't do anything fancy (IIRC). We 
> should make sure all consumers are generally going to be OK with the API we 
> create.
> The API may be "OK" (or OK in a part). We need to also consider other methods 
> which were "bolted" on such as {{AbstractFSWAL}} and 
> {{WALFileLengthProvider}}. Other corners of "WAL use" (like the 
> {{WALSplitter}} should also be looked at to use WAL-APIs only).
> We also need to make sure that adequate interface audience and stability 
> annotations are chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675900#comment-16675900
 ] 

Ted Yu edited comment on HBASE-21387 at 11/6/18 12:47 AM:
--

In two-pass-cleaner.v6.txt , the reference to previous round is changed to 
Set.
The length of file is needed by HFileCleaner


was (Author: yuzhih...@gmail.com):
In two-pass-cleaner.v5.txt , the reference to previous round is changed to 
Set.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: two-pass-cleaner.v6.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: (was: two-pass-cleaner.v5.txt)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

2018-11-05 Thread Ankit Singhal (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Singhal updated HBASE-21440:
--
Description: 
When the server crashes, it's SCP checks if there is already a procedure 
assigning the region on this crashed server. If we found one, SCP will just 
interrupt the already running AssignProcedure by calling remoteCallFailed which 
internally just changes the region node state to OFFLINE and send the procedure 
back with transition queue state for assignment with a new plan. 

But, due to the race condition between the calling of the remoteCallFailed and 
current state of the already running assign procedure(REGION_TRANSITION_FINISH: 
where the region is already opened), it is possible that assign procedure goes 
ahead in updating the regionStateNode to OPEN on a crashed server. 

As SCP had already skipped this region for assignment as it was relying on 
existing assign procedure to do the right thing, this whole confusion leads 
region to a not accessible state.

  was:
When the server crashes and it's SCP checks if there is already a procedure 
assigning the region on this crashed server. If we found one, SCP will just 
interrupt the already running AssignProcedure by calling remoteCallFailed which 
just changes the region node state to OFFLINE and send the procedure back with 
transition queue state for assignment with a new plan. 

But, due to the race condition between the calling of the remoteCallFailed and 
current state of the already running assign procedure(REGION_TRANSITION_FINISH: 
where the region is already opened), it is possible that assign procedure goes 
ahead in updating the regionStateNode to OPEN on a crashed server. 

As SCP had already skipped this region for assignment as it was relying on 
existing assign procedure to do the right thing, this whole confusion leads 
region to a not accessible state.


> Assign procedure on the crashed server is not properly interrupted
> --
>
> Key: HBASE-21440
> URL: https://issues.apache.org/jira/browse/HBASE-21440
> Project: HBase
>  Issue Type: Bug
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Fix For: 2.0.2
>
>
> When the server crashes, it's SCP checks if there is already a procedure 
> assigning the region on this crashed server. If we found one, SCP will just 
> interrupt the already running AssignProcedure by calling remoteCallFailed 
> which internally just changes the region node state to OFFLINE and send the 
> procedure back with transition queue state for assignment with a new plan. 
> But, due to the race condition between the calling of the remoteCallFailed 
> and current state of the already running assign 
> procedure(REGION_TRANSITION_FINISH: where the region is already opened), it 
> is possible that assign procedure goes ahead in updating the regionStateNode 
> to OPEN on a crashed server. 
> As SCP had already skipped this region for assignment as it was relying on 
> existing assign procedure to do the right thing, this whole confusion leads 
> region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

2018-11-05 Thread Ankit Singhal (JIRA)
Ankit Singhal created HBASE-21440:
-

 Summary: Assign procedure on the crashed server is not properly 
interrupted
 Key: HBASE-21440
 URL: https://issues.apache.org/jira/browse/HBASE-21440
 Project: HBase
  Issue Type: Bug
Reporter: Ankit Singhal
Assignee: Ankit Singhal
 Fix For: 2.0.2


When the server crashes and it's SCP checks if there is already a procedure 
assigning the region on this crashed server. If we found one, SCP will just 
interrupt the already running AssignProcedure by calling remoteCallFailed which 
just changes the region node state to OFFLINE and send the procedure back with 
transition queue state for assignment with a new plan. 

But, due to the race condition between the calling of the remoteCallFailed and 
current state of the already running assign procedure(REGION_TRANSITION_FINISH: 
where the region is already opened), it is possible that assign procedure goes 
ahead in updating the regionStateNode to OPEN on a crashed server. 

As SCP had already skipped this region for assignment as it was relying on 
existing assign procedure to do the right thing, this whole confusion leads 
region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT

2018-11-05 Thread Jingyun Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675912#comment-16675912
 ] 

Jingyun Tian commented on HBASE-21437:
--

[~allan163] Thanks for your comment. My initial idea is remove the procedure 
from delay queue and add a condition check of bypass when execute procedure.  
Your idea can avoid modify code of PE. Let me try this out.

> Bypassed procedure throw IllegalArgumentException when its state is 
> WAITING_TIMEOUT
> ---
>
> Key: HBASE-21437
> URL: https://issues.apache.org/jira/browse/HBASE-21437
> Project: HBase
>  Issue Type: Bug
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Major
>
> {code}
> 2018-11-05,18:25:52,735 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, 
> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, 
> bypass=true; TransitRegionStateProcedure table=test_fail
> over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN
> at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948)
> 2018-11-05,18:25:52,736 TRACE 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated.
> {code}
> Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state 
> is still WAITING_TIMEOUT, then when executor run this procedure, it will 
> throw exception and cause worker terminated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675900#comment-16675900
 ] 

Ted Yu commented on HBASE-21387:


In two-pass-cleaner.v5.txt , the reference to previous round is changed to 
Set.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: two-pass-cleaner.v5.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21432) [hbase-connectors] Add Apache Yetus integration for hbase-connectors repository

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675886#comment-16675886
 ] 

Sean Busbey commented on HBASE-21432:
-

How about just PRs? Right now yetus has no notion of multiple repositories 
corresponding to a single JIRA tracker, so if we want patches on JIRA to work 
we'll need to add our own logic for it.

By comparison, a job to periodically check for open PRs and then test them 
should be much simpler.

> [hbase-connectors] Add Apache Yetus integration for hbase-connectors 
> repository 
> 
>
> Key: HBASE-21432
> URL: https://issues.apache.org/jira/browse/HBASE-21432
> Project: HBase
>  Issue Type: Task
>  Components: build, hbase-connectors
>Affects Versions: connector-1.0.0
>Reporter: Peter Somogyi
>Priority: Major
>
> Add automated testing for pull requests and patch files created for 
> hbase-connectors repository. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21439) StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost functions

2018-11-05 Thread Ben Lau (JIRA)
Ben Lau created HBASE-21439:
---

 Summary: StochasticLoadBalancer RegionLoads aren’t being used in 
RegionLoad cost functions
 Key: HBASE-21439
 URL: https://issues.apache.org/jira/browse/HBASE-21439
 Project: HBase
  Issue Type: Bug
  Components: Balancer
Affects Versions: 2.0.2, 1.3.2.1
Reporter: Ben Lau
Assignee: Ben Lau


In StochasticLoadBalancer.updateRegionLoad() the region loads are being put 
into the map with Bytes.toString(regionName).

First, this is a problem because Bytes.toString() assumes that the byte array 
is a UTF8 encoded String but there is no guarantee that regionName bytes are 
legal UTF8.

Secondly, in BaseLoadBalancer.registerRegion, we are reading the region loads 
out of the load map not using Bytes.toString() but using 
region.getRegionNameAsString() and region.getEncodedName().  So the load 
balancer will not see or use any of the cluster's RegionLoad history.

There are 2 primary ways to solve this issue, assuming we want to stay with 
String keys for the load map (seems reasonable to aid debugging).  We can 
either fix updateRegionLoad to store the regionName as a string properly or we 
can update both the reader & writer to use a new common valid String 
representation.

Will post a patch assuming we want to pursue the original intention, i.e. store 
regionNameAsAString for the loadmap key, but I'm open to fixing this a 
different way.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675871#comment-16675871
 ] 

Josh Elser commented on HBASE-21387:


{quote}
bq.where a snapshot was "orphaned" and prevent file cleaning from happening

I think by "orphaned" you are talking about not just two iterations for cleaner 
chore but many iterations.
In that case, the situation in the current code base would prevent cleaning 
hfiles referenced, as well.
{quote}

Yes, that's the situation I mean. Perhaps we shouldn't be overly concerned 
about this one. I certainly think it is the cleaner approach (and something we 
can more easily reason about).

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675869#comment-16675869
 ] 

Josh Elser commented on HBASE-21246:


[~stack], [~reidchan], see Ted's attached images  
!replication-src-creates-wal-reader.jpg!  !wal-splitter-reader.jpg!  
!wal-splitter-writer.jpg! for ReplicationSource reading, WALSplitter reading, 
and WALSplitter writing, respectively. I'm worried we've swung too far back in 
the other direction (too simple), but I also am at a loss of what to suggest 
Ted provide, too.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675867#comment-16675867
 ] 

Ted Yu commented on HBASE-21387:


bq. holding onto file names in memory

We don't need to continue referencing FileStatus from the previous pass. Path 
(or String) for each file would be sufficient.

bq. where a snapshot was "orphaned" and prevent file cleaning from happening

I think by "orphaned" you are talking about not just two iterations for cleaner 
chore but many iterations.
In that case, the situation in the current code base would prevent cleaning 
hfiles referenced, as well.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21436) Getting OOM frequently if hold many regions

2018-11-05 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675865#comment-16675865
 ] 

Vladimir Rodionov commented on HBASE-21436:
---

Hi, I do not see anything worth of fixing here. First of all, 2MB per Region is 
the overhead customer must accept and live with. Second of all,  3K regions per 
RS is too high for 4GB heap size - this goes against best practices and 
recommendations.






>  Getting OOM frequently if hold many regions
> 
>
> Key: HBASE-21436
> URL: https://issues.apache.org/jira/browse/HBASE-21436
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 1.4.8, 2.1.1, 2.0.2
>Reporter: Zephyr Guo
>Priority: Major
> Attachments: HBASE-21436-UT.patch
>
>
> Recently, some feedback reached me from a customer which complains about 
> NotServingRegionException thrown out at intevals. I examined his cluster and 
> found there were quite a lot of OOM logs there but throughtput is in quite 
> low level. In this customer's case, each RS has 3k regions and heap size of 
> 4G. I dumped heap when OOM took place, and found that a lot of Chunk objects 
> (counts as much as 1700) was there.
> Eventually, piecing all these evidences together, I came to the conclusion 
> that:
>  * The root cause is that global flush is triggered by size of all memstores, 
> rather than size of all chunks.
>  * A chunk is always allocated for each region, even we only write a few data 
> to the region.
> And in this case, a total of 3.4G memory was consumed by 1700 chunks, 
> although throughput is very low.
>  Although 3K regions is too much for RS with 4G memory, it is still wise to 
> improve RS stability in such scenario (In fact, most customers buy a small 
> size HBase on cloud side).
>   
>  I provide a patch (only contain UT) to reproduce this case (just send a 
> batch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675857#comment-16675857
 ] 

Josh Elser commented on HBASE-21387:


{quote}
Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track 
of the files deemed cleanable from previous iteration.
Only files deemed cleanable from previous and current iterations would be 
deleted.
{quote}

I think that would help, but I'm not sure if that's the best way to go about 
it. We're holding onto file names in memory (which could get big) with your 
two-pass-cleaner.v4 patch. If we made a change to not clean files while there 
is an in-progress snapshot, we would be in trouble if we ever got into a 
situation where a snapshot was "orphaned" and prevent file cleaning from 
happening.

I'm not sure which I think is better (or if there's even still something better 
out there..)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675805#comment-16675805
 ] 

Ted Yu commented on HBASE-21387:


Solving the in progress snapshot race condition is tricky.

Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track 
of the files deemed cleanable from previous iteration.
Only files deemed cleanable from previous and current iterations would be 
deleted.

This is a bigger change.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: two-pass-cleaner.v4.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Status: Open  (was: Patch Available)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675799#comment-16675799
 ] 

Josh Elser commented on HBASE-21387:


Ok, cool. Thanks for confirming, Ted.

I can appreciate where your fix is coming from, but I'm still suspect of this 
being a complete fix.

In the current implementation of SnapshotFileCache#getSnapshotsInProgress(..), 
we acquire the lock on the in-progress snapshot before listing the files for 
it. That means the call to getSnapshotsInProgress(..) will block until the 
operation is complete (both for online and offline snapshot generation). So, we 
should never have a case where we read a snapshot's files while it's in the 
process of being written.

However, it does seems like there could be a case where a snapshot we knew to 
be in-progress finishes before the SnapshotFileCleaner "wakes up" (e.g. after 
TakeSnapshotHandler.completeSnapshot(..) is invoked). We may miss this newly 
created snapshot (which was in-progress when we started 
SnapshotFileCache#getUnreferencedFiles but is complete when we finish it).

The bigger problem is that there is still the potential for the 
SnapshotFileCleaner to go to sleep during {{getSnapshotsInProgress}} (in the 
current code or your patch) and miss a newly started snapshot. I do not think 
we can safely identify the files to retain for snapshots without precluding the 
submission of new snapshots. I think your v8 patch improves this situation but 
does not completely solve it.

WDYT?



> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675762#comment-16675762
 ] 

Ted Yu commented on HBASE-21381:


liubang:
2.7.x and 2.8.y should be added as supported hadoop releases since they were 
not affected by HADOOP-11794

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675738#comment-16675738
 ] 

Ted Yu commented on HBASE-21387:


Thanks for giving the timeline, Josh.

The scenario you described is the race condition I am solving with patch v8.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675711#comment-16675711
 ] 

Josh Elser commented on HBASE-21387:


Looking through v8.. Started off still struggling to understand the race 
condition, but I think I see it now.

At time T0, we are checking if F1 is referenced. At time T1, there is a 
snapshot S1 in progress that is referencing a file F1. refreshCache() is 
called, but no completed snapshot references F1. At T2, the snapshot S1, which 
references F1, completes. At T3, we check in-progress snapshots and S1 is not 
included. Thus, F1 is marked as unreferenced even though S1 references it. This 
is what you are saying is the issue, Ted?

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675645#comment-16675645
 ] 

Hadoop QA commented on HBASE-21438:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
2s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for 
instructions. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
42s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 4s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 1s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 49s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
20s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 34m 10s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21438 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946949/21438.v1.txt |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 090dcd9ba515 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 7395ffac44 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |

[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675576#comment-16675576
 ] 

Wei-Chiu Chuang commented on HBASE-21381:
-

Hadoop 2.7.x and 2.8.x should also work with B&R, I suppose?

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21438:
---
Attachment: 21438.v1.txt

> TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
> --
>
> Key: HBASE-21438
> URL: https://issues.apache.org/jira/browse/HBASE-21438
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21438.v1.txt
>
>
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
>  :
> {code}
> Mon Nov 05 04:52:13 UTC 2018, 
> RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
> org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and 
> have an empty constructor
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21438:
---
Status: Patch Available  (was: Open)

> TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
> --
>
> Key: HBASE-21438
> URL: https://issues.apache.org/jira/browse/HBASE-21438
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21438.v1.txt
>
>
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
>  :
> {code}
> Mon Nov 05 04:52:13 UTC 2018, 
> RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
> org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and 
> have an empty constructor
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675573#comment-16675573
 ] 

Ted Yu commented on HBASE-21438:


Ran TestAdmin2 with patch which passed.

> TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
> --
>
> Key: HBASE-21438
> URL: https://issues.apache.org/jira/browse/HBASE-21438
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21438.v1.txt
>
>
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
>  :
> {code}
> Mon Nov 05 04:52:13 UTC 2018, 
> RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: 
> org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
> org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and 
> have an empty constructor
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible

2018-11-05 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21438:
--

 Summary: TestAdmin2#testGetProcedures fails due to FailedProcedure 
inaccessible
 Key: HBASE-21438
 URL: https://issues.apache.org/jira/browse/HBASE-21438
 Project: HBase
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Ted Yu


>From 
>https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/
> :
{code}
Mon Nov 05 04:52:13 UTC 2018, RpcRetryingCaller{globalStartTime=1541393533029, 
pause=250, maxAttempts=7}, 
org.apache.hadoop.hbase.procedure2.BadProcedureException: 
org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class 
org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and have 
an empty constructor
 at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162)
 at 
org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249)
 at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675530#comment-16675530
 ] 

Ted Yu commented on HBASE-21381:


lgtm

> Document the hadoop versions using which backup and restore feature works
> -
>
> Key: HBASE-21381
> URL: https://issues.apache.org/jira/browse/HBASE-21381
> Project: HBase
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: liubangchen
>Priority: Major
> Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch
>
>
> HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally 
> tried to concatenate the files being DistCp'ed to target cluster (though the 
> files are independent).
> Following is the log snippet of the failed concatenation attempt:
> {code}
> 2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): 
> job_local1795473782_0004
> java.io.IOException: Inconsistent sequence file: current chunk file 
> org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/
>
> 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
>  length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry 
> org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-
>
> 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
>  length = 5142 aclEntries = null, xAttrs = null}
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
>   at 
> org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
>   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
> {code}
> Backup and Restore uses DistCp to transfer files between clusters.
> Without the fix from HADOOP-15850, the transfer would fail.
> This issue is to document the hadoop versions which contain HADOOP-15850 so 
> that user of Backup and Restore feature knows which hadoop versions they can 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675528#comment-16675528
 ] 

Ted Yu edited comment on HBASE-21246 at 11/5/18 6:20 PM:
-

In patch v24, I dropped the static methods from WALFactory - there are used in 
test code.

I also removed reference to AbstractFSWALProvider in WALFactory since the 
Reader creation is done by the provider.


was (Author: yuzhih...@gmail.com):
In patch v24, I dropped the static methods from WALFactory - there are used in 
test code.

I also removed reference to AbstractFSWALProvider since the Reader creation is 
done by the provider.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675528#comment-16675528
 ] 

Ted Yu commented on HBASE-21246:


In patch v24, I dropped the static methods from WALFactory - there are used in 
test code.

I also removed reference to AbstractFSWALProvider since the Reader creation is 
done by the provider.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.24.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675476#comment-16675476
 ] 

Josh Elser commented on HBASE-19953:


[~allan163], I understand the RPC timeout issue that you bring up, but I think 
the right solution is to fix how we execute postHooks.

If we applied your patch now, it would introduce a regression for the original 
issue that I created this change. Specifically, that issue is that when a 2.x 
client (which has procv2 support) submits an operation that doesn't use a 
blocking lock, there is a race condition between the execution of the postHook 
and the procedure for the operation itself.

Specifically, the original problem was that an Apache Ranger-owned 
postDeleteNamespace hook was expecting that the HBase namespace would not exist 
when it was invoked. In reality, the namespace still existed because the 
DeleteNamespaceProcedure had not yet run when we used the non-blocking latch. 
Your patch would reintroduce this problem directly.

So, there are two goals: one (that you raised) is that we want clients to not 
have to hold open an RPC for a procedure to complete. The second (that I raised 
here) is that we must not call a post\* hook before the actual operation is 
completed.

I think the way to solve both of these is to push execution of the post\* hook 
into the procedure itself. That way, we do not execute the hook too "soon" and 
the client can just poll the procedure, waiting for completion (as we want).

> Avoid calling post* hook when procedure fails
> -
>
> Key: HBASE-19953
> URL: https://issues.apache.org/jira/browse/HBASE-19953
> Project: HBase
>  Issue Type: Bug
>  Components: master, proc-v2
>Reporter: Ramesh Mani
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.0-beta-2, 2.0.0
>
> Attachments: HBASE-19952.001.branch-2.patch, 
> HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, 
> HBASE-19953.branch-2.0.addendum.patch
>
>
> Ramesh pointed out a case where I think we're mishandling some post\* 
> MasterObserver hooks. Specifically, I'm looking at the deleteNamespace.
> We synchronously execute the DeleteNamespace procedure. When the user 
> provides a namespace that isn't empty, the procedure does a rollback (which 
> is just a no-op), but this doesn't propagate an exception up to the 
> NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh 
> pointing it out a bit better to me that the code executes a bit differently 
> than we actually expect.
> I think we need to double-check our post hooks and make sure we aren't 
> invoking them when the procedure actually failed. cc/ [~Apache9], [~stack].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675484#comment-16675484
 ] 

Sean Busbey commented on HBASE-21255:
-

the precommit failures were probably the openjdk / surefire conflict that got 
handled in HBASE-21417 (and it looks like they're clear now).

please fix checkstyle.

> [acl] Refactor TablePermission into three classes (Global, Namespace, Table)
> 
>
> Key: HBASE-21255
> URL: https://issues.apache.org/jira/browse/HBASE-21255
> Project: HBase
>  Issue Type: Improvement
>Reporter: Reid Chan
>Assignee: Reid Chan
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21225.master.001.patch, 
> HBASE-21225.master.002.patch, HBASE-21255.master.003.patch, 
> HBASE-21255.master.004.patch, HBASE-21255.master.005.patch, 
> HBASE-21255.master.006.patch
>
>
> A TODO in {{TablePermission.java}}
> {code:java}
>   //TODO refactor this class
>   //we need to refacting this into three classes (Global, Table, Namespace)
> {code}
> Change Notes:
>  * Divide origin TablePermission into three classes GlobalPermission, 
> NamespacePermission, TablePermission
>  * New UserPermission consists of a user name and a permission in one of 
> [Global, Namespace, Table]Permission.
>  * Rename TableAuthManager to AuthManager(it is IA.P), and rename some 
> methods for readability.
>  * Make PermissionCache thread safe, and the ListMultiMap is changed to Set.
>  * User cache and group cache in AuthManager is combined together.
>  * Wire proto is kept, BC should be under guarantee.
>  * Fix HBASE-21390.
>  * Resolve a small {{TODO}} global entry should be handled differently in 
> AccessControlLists



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675458#comment-16675458
 ] 

Ted Yu commented on HBASE-21246:


replication-src-creates-wal-reader.jpg shows how WAL Reader is created for 
replication source:

ReplicationSource calls walProvider#getWalStream which returns WALEntryStream.
AbstractWALEntryStream#createReader calls WALProvider#createReader

wal-splitter-reader.jpg shows how WAL Reader is created for log splitting:

WALSplitter#getReader calls WALProvider#createReader.
Below WALProvider, AbstractFSWALProvider and DisabledWALProvider are shown 
which implement WALProvider interface.
AsyncFSWALProvider and FSHLogProvider extend AbstractFSWALProvider

wal-splitter-writer.jpg shows how WAL Writer is created for log splitting.





> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: replication-src-creates-wal-reader.jpg

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: wal-splitter-writer.jpg
wal-splitter-reader.jpg

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675433#comment-16675433
 ] 

Hudson commented on HBASE-21395:


Results for branch branch-2.1
[build #580 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Abort split/merge procedure if there is a table procedure of the same table 
> going on
> 
>
> Key: HBASE-21395
> URL: https://issues.apache.org/jira/browse/HBASE-21395
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21395.branch-2.0.001.patch, 
> HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, 
> HBASE-21395.branch-2.0.004.patch
>
>
> In my ITBLL, I often see that if split/merge procedure and table 
> procedure(like ModifyTableProcedure) happen at the same time, and since there 
> some race conditions between these two kind of procedures,  causing some 
> serious problems. e.g. the split/merged parent is bought on line by the table 
> procedure or the split merged region making the whole table procedure 
> rollback.
> Talked with [~Apache9] offline today, this kind of problem was solved in 
> branch-2+ since There is a fence that only one RTSP can agianst a single 
> region at the same time.
> To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe 
> fence in the split/merge procedure: If there is a table procedure going on 
> against the same table, then abort the split/merge procedure. Aborting the 
> split/merge procedure at the beginning of the execution is no big deal, 
> compared with the mess it will cause...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675434#comment-16675434
 ] 

Hudson commented on HBASE-21423:


Results for branch branch-2.1
[build #580 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675435#comment-16675435
 ] 

Josh Elser commented on HBASE-19953:


{quote}
After a careful think, I think we don't need to revert this(I've already 
changed the comment above), we only need to turn ModifyTable to a async 
op(Which is the only sync DDL for 2.x client now). Josh Elser you can see 
HMaster.truncateTable. We also use ProcedurePrepareLatch.createLatch(2, 0) to 
make sure the 2.x client won't sync wait here.
Uploaded a addendum to clarify my point.
{quote}

Thanks, Allan. I appreciate the clarification -- I felt bad lobbing a "-1" onto 
this as I did. Glad we can talk it out. The addendum is very helpful and I 
appreciate that too.

I need to refresh my head around the original problem and think about the 
solution you have suggested. I think it makes sense to me, but I need to make 
sure things still work as they did in 2.0.0 :). I'll do this now before I get 
pulled in another direction.

On that note, since this did go out in 2.0.0, can you make this change under a 
different Jira issue instead of as an addendum to this one? That would remove 
potential confusion about 2.0.3 having it but not 2.0.1 and 2.0.2.

> Avoid calling post* hook when procedure fails
> -
>
> Key: HBASE-19953
> URL: https://issues.apache.org/jira/browse/HBASE-19953
> Project: HBase
>  Issue Type: Bug
>  Components: master, proc-v2
>Reporter: Ramesh Mani
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.0-beta-2, 2.0.0
>
> Attachments: HBASE-19952.001.branch-2.patch, 
> HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, 
> HBASE-19953.branch-2.0.addendum.patch
>
>
> Ramesh pointed out a case where I think we're mishandling some post\* 
> MasterObserver hooks. Specifically, I'm looking at the deleteNamespace.
> We synchronously execute the DeleteNamespace procedure. When the user 
> provides a namespace that isn't empty, the procedure does a rollback (which 
> is just a no-op), but this doesn't propagate an exception up to the 
> NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh 
> pointing it out a bit better to me that the code executes a bit differently 
> than we actually expect.
> I think we need to double-check our post hooks and make sure we aren't 
> invoking them when the procedure actually failed. cc/ [~Apache9], [~stack].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675431#comment-16675431
 ] 

Hadoop QA commented on HBASE-21421:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.0 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
 2s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
46s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
55s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
11s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} branch-2.0 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
10s{color} | {color:red} hbase-server: The patch generated 1 new + 25 unchanged 
- 0 fixed = 26 total (was 25) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
58s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 14s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}118m 
21s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 |
| JIRA Issue | HBASE-21421 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946919/HBASE-21421.branch-2.0.004.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 21b01692edfa 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2.0 / d4233f207d |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14958/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14958/testReport/ |
| Max. process+thread count | 3957 (vs. ulimit of 1) |
| modules | C: hba

[jira] [Updated] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers

2018-11-05 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21247:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks for the review, Sean and Josh.

> Custom WAL Provider cannot be specified by configuration whose value is 
> outside the enums in Providers
> --
>
> Key: HBASE-21247
> URL: https://issues.apache.org/jira/browse/HBASE-21247
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, 
> 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, 
> 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt
>
>
> Currently all the WAL Providers acceptable to hbase are specified in 
> Providers enum of WALFactory.
> This restricts the ability for additional WAL Providers to be supplied - by 
> class name.
> This issue fixes the bug by allowing the specification of new WAL Provider 
> class name using the config "hbase.wal.provider".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675398#comment-16675398
 ] 

Hudson commented on HBASE-21395:


Results for branch branch-2.0
[build #1060 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Abort split/merge procedure if there is a table procedure of the same table 
> going on
> 
>
> Key: HBASE-21395
> URL: https://issues.apache.org/jira/browse/HBASE-21395
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21395.branch-2.0.001.patch, 
> HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, 
> HBASE-21395.branch-2.0.004.patch
>
>
> In my ITBLL, I often see that if split/merge procedure and table 
> procedure(like ModifyTableProcedure) happen at the same time, and since there 
> some race conditions between these two kind of procedures,  causing some 
> serious problems. e.g. the split/merged parent is bought on line by the table 
> procedure or the split merged region making the whole table procedure 
> rollback.
> Talked with [~Apache9] offline today, this kind of problem was solved in 
> branch-2+ since There is a fence that only one RTSP can agianst a single 
> region at the same time.
> To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe 
> fence in the split/merge procedure: If there is a table procedure going on 
> against the same table, then abort the split/merge procedure. Aborting the 
> split/merge procedure at the beginning of the execution is no big deal, 
> compared with the mess it will cause...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675399#comment-16675399
 ] 

Hudson commented on HBASE-21423:


Results for branch branch-2.0
[build #1060 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20952) Re-visit the WAL API

2018-11-05 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675392#comment-16675392
 ] 

Josh Elser commented on HBASE-20952:


{quote}
Don't see much movement on the feature branch for this. I'd like to disable the 
nightly test job for it until development picks up again.

Please shout if you'd prefer it stay on. If so, please plan to clean up 
failures.
{quote}

"Shout". Will refresh from master.
 

> Re-visit the WAL API
> 
>
> Key: HBASE-20952
> URL: https://issues.apache.org/jira/browse/HBASE-20952
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
> Attachments: 20952.v1.txt
>
>
> Take a step back from the current WAL implementations and think about what an 
> HBase WAL API should look like. What are the primitive calls that we require 
> to guarantee durability of writes with a high degree of performance?
> The API needs to take the current implementations into consideration. We 
> should also have a mind for what is happening in the Ratis LogService (but 
> the LogService should not dictate what HBase's WAL API looks like RATIS-272).
> Other "systems" inside of HBase that use WALs are replication and 
> backup&restore. Replication has the use-case for "tail"'ing the WAL which we 
> should provide via our new API. B&R doesn't do anything fancy (IIRC). We 
> should make sure all consumers are generally going to be OK with the API we 
> create.
> The API may be "OK" (or OK in a part). We need to also consider other methods 
> which were "bolted" on such as {{AbstractFSWAL}} and 
> {{WALFileLengthProvider}}. Other corners of "WAL use" (like the 
> {{WALSplitter}} should also be looked at to use WAL-APIs only).
> We also need to make sure that adequate interface audience and stability 
> annotations are chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675358#comment-16675358
 ] 

Sean Busbey commented on HBASE-21247:
-

+1

nit:
{code}
129   public Class getProviderClass(String key,
130   Class defaultVal) {
131 if (conf.get(key) == null) {
132   return conf.getClass(key, defaultVal, WALProvider.class);
133 }
{code}

I think this extra call to {{getClass}} instead of just returning 
{{defaultVal}} (since we already know {{key}} doesn't map to a value) is 
confusing, but I suppose this is _technically_ more correct in case we have 
some kind of special {{Configuration}} implementation that has its own opinion 
about how fallback to the passed default class works.

> Custom WAL Provider cannot be specified by configuration whose value is 
> outside the enums in Providers
> --
>
> Key: HBASE-21247
> URL: https://issues.apache.org/jira/browse/HBASE-21247
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, 
> 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, 
> 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt
>
>
> Currently all the WAL Providers acceptable to hbase are specified in 
> Providers enum of WALFactory.
> This restricts the ability for additional WAL Providers to be supplied - by 
> class name.
> This issue fixes the bug by allowing the specification of new WAL Provider 
> class name using the config "hbase.wal.provider".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20952) Re-visit the WAL API

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675317#comment-16675317
 ] 

Sean Busbey commented on HBASE-20952:
-

Don't see much movement on the feature branch for this. I'd like to disable the 
nightly test job for it until development picks up again.

Please shout if you'd prefer it stay on. If so, please plan to clean up 
failures.

> Re-visit the WAL API
> 
>
> Key: HBASE-20952
> URL: https://issues.apache.org/jira/browse/HBASE-20952
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
> Attachments: 20952.v1.txt
>
>
> Take a step back from the current WAL implementations and think about what an 
> HBase WAL API should look like. What are the primitive calls that we require 
> to guarantee durability of writes with a high degree of performance?
> The API needs to take the current implementations into consideration. We 
> should also have a mind for what is happening in the Ratis LogService (but 
> the LogService should not dictate what HBase's WAL API looks like RATIS-272).
> Other "systems" inside of HBase that use WALs are replication and 
> backup&restore. Replication has the use-case for "tail"'ing the WAL which we 
> should provide via our new API. B&R doesn't do anything fancy (IIRC). We 
> should make sure all consumers are generally going to be OK with the API we 
> create.
> The API may be "OK" (or OK in a part). We need to also consider other methods 
> which were "bolted" on such as {{AbstractFSWAL}} and 
> {{WALFileLengthProvider}}. Other corners of "WAL use" (like the 
> {{WALSplitter}} should also be looked at to use WAL-APIs only).
> We also need to make sure that adequate interface audience and stability 
> annotations are chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675301#comment-16675301
 ] 

Hadoop QA commented on HBASE-21421:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.0 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
43s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
43s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
57s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
38s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} branch-2.0 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
14s{color} | {color:red} hbase-server: The patch generated 1 new + 25 unchanged 
- 0 fixed = 26 total (was 25) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
14s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 13s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}119m 
23s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}155m 34s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 |
| JIRA Issue | HBASE-21421 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946904/HBASE-21421.branch-2.0.003.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 6890e9934077 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2.0 / d4233f207d |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14956/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14956/testReport/ |
| Max. process+thread count | 4353 (vs. ulimit of 1) |
| modules | C: hbase-

[jira] [Commented] (HBASE-21430) [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675265#comment-16675265
 ] 

Sean Busbey commented on HBASE-21430:
-

just send them to the existing {{issues@}} list.

I do think its own repo would give us more options around how we maintain 
compatibility with various spark release lines. I'm up for whatever folks want 
to do if they're doing the work though. :)

> [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo
> -
>
> Key: HBASE-21430
> URL: https://issues.apache.org/jira/browse/HBASE-21430
> Project: HBase
>  Issue Type: Bug
>  Components: hbase-connectors, spark
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Exploring moving the spark modules out of core hbase and into 
> hbase-connectors. Perhaps spark is deserving of its own repo (I think 
> [~busbey] was on about this) but meantime, experimenting w/ having it out in 
> hbase-connectors.
> Here is thread on spark integration 
> https://lists.apache.org/thread.html/fd74ef9b9da77abf794664f06ea19c839fb3d543647fb29115081683@%3Cdev.hbase.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675254#comment-16675254
 ] 

Sean Busbey commented on HBASE-21427:
-

related / not related I agree that nightly should be run once / day. the flaky 
job though loses a lot of utility if it isn't running very often. By their 
definition these are tests that we need multiple runs to see failures and 
multiple parallel runs for catch-up are a feature rather than a bug.

> Make it so flakie's and nightlies build rates are configurable from jenkins 
> Config page
> ---
>
> Key: HBASE-21427
> URL: https://issues.apache.org/jira/browse/HBASE-21427
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: stack
>Priority: Major
>
> Request is that rather than have to change code whenever we want to change 
> build rates, would be easier doing change in jenkins Config screen.
> My guess is easy enough to do... .
> ^^
> [~busbey] Lets chat (or I could bang my head)
> See HBASE-21424 for a bit of background.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page

2018-11-05 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675250#comment-16675250
 ] 

Sean Busbey commented on HBASE-21427:
-

IIRC it was not easy to do this because of how multi-branch pipelines work in 
jenkins.

What if we moved this stuff into its own repo, something like {{hbase-ci}}? we 
could probably remove most of {{dev-support}}. we'd need to refactor things a 
bit since the job would have to manage what branches are tested itself, but 
we'd gain the ability to throttle ourselves project wide as opposed to now 
where it's per-branch.

I went the per-branch route originally because it's what the jenkins docs 
encourage and it means we automagically have an equivalent amount of testing of 
new feature branches as the main branch. But I don't think folks are aware of 
the amount of testing done by default when they make a branch. Centralizing 
things would flip this around and make folks opt-in to it. We'd need to 
document that as a part of docs about "how to make a feature branch".

> Make it so flakie's and nightlies build rates are configurable from jenkins 
> Config page
> ---
>
> Key: HBASE-21427
> URL: https://issues.apache.org/jira/browse/HBASE-21427
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: stack
>Priority: Major
>
> Request is that rather than have to change code whenever we want to change 
> build rates, would be easier doing change in jenkins Config screen.
> My guess is easy enough to do... .
> ^^
> [~busbey] Lets chat (or I could bang my head)
> See HBASE-21424 for a bit of background.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675249#comment-16675249
 ] 

Hadoop QA commented on HBASE-21314:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
25s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
12s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
10m 45s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
19s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 36m 12s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21314 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946913/HBASE-21314-v1.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 19f8bdee45dc 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / c8574ba3c5 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14957/testReport/ |
| Max. process+thread count | 286 (vs. ulimit of 1) |
| modules | C: hbase-procedure U: hbase-procedure |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14957/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> The i

[jira] [Comment Edited] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675244#comment-16675244
 ] 

Allan Yang edited comment on HBASE-21437 at 11/5/18 2:33 PM:
-

Maybe we can't resubmit a WAIT_TIMEOUT procedure directly when bypassing, since 
it should be waken by the timeout. Let the setTimeoutFailure method turn it 
into RUNNABLE state.


was (Author: allan163):
Maybe we can't resubmit a WAIT_TIMEOUT procedure when bypassing, since it 
should be wakend by the timeout. Let the setTimeoutFailure method turn it into 
RUNNABLE state.

> Bypassed procedure throw IllegalArgumentException when its state is 
> WAITING_TIMEOUT
> ---
>
> Key: HBASE-21437
> URL: https://issues.apache.org/jira/browse/HBASE-21437
> Project: HBase
>  Issue Type: Bug
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Major
>
> {code}
> 2018-11-05,18:25:52,735 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, 
> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, 
> bypass=true; TransitRegionStateProcedure table=test_fail
> over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN
> at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948)
> 2018-11-05,18:25:52,736 TRACE 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated.
> {code}
> Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state 
> is still WAITING_TIMEOUT, then when executor run this procedure, it will 
> throw exception and cause worker terminated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675244#comment-16675244
 ] 

Allan Yang commented on HBASE-21437:


Maybe we can't resubmit a WAIT_TIMEOUT procedure when bypassing, since it 
should be wakend by the timeout. Let the setTimeoutFailure method turn it into 
RUNNABLE state.

> Bypassed procedure throw IllegalArgumentException when its state is 
> WAITING_TIMEOUT
> ---
>
> Key: HBASE-21437
> URL: https://issues.apache.org/jira/browse/HBASE-21437
> Project: HBase
>  Issue Type: Bug
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Major
>
> {code}
> 2018-11-05,18:25:52,735 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, 
> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, 
> bypass=true; TransitRegionStateProcedure table=test_fail
> over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN
> at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948)
> 2018-11-05,18:25:52,736 TRACE 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated.
> {code}
> Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state 
> is still WAITING_TIMEOUT, then when executor run this procedure, it will 
> throw exception and cause worker terminated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page

2018-11-05 Thread Sean Busbey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-21427:

Issue Type: Improvement  (was: Bug)

> Make it so flakie's and nightlies build rates are configurable from jenkins 
> Config page
> ---
>
> Key: HBASE-21427
> URL: https://issues.apache.org/jira/browse/HBASE-21427
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: stack
>Priority: Major
>
> Request is that rather than have to change code whenever we want to change 
> build rates, would be easier doing change in jenkins Config screen.
> My guess is easy enough to do... .
> ^^
> [~busbey] Lets chat (or I could bang my head)
> See HBASE-21424 for a bit of background.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page

2018-11-05 Thread Sean Busbey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-21427:

Component/s: test

> Make it so flakie's and nightlies build rates are configurable from jenkins 
> Config page
> ---
>
> Key: HBASE-21427
> URL: https://issues.apache.org/jira/browse/HBASE-21427
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: stack
>Priority: Major
>
> Request is that rather than have to change code whenever we want to change 
> build rates, would be easier doing change in jenkins Config screen.
> My guess is easy enough to do... .
> ^^
> [~busbey] Lets chat (or I could bang my head)
> See HBASE-21424 for a bit of background.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21421:
---
Attachment: HBASE-21421.branch-2.0.004.patch

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, 
> HBASE-21421.branch-2.0.004.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21314:
---
Comment: was deleted

(was: Got it, +1 for the V3 patch. )

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675200#comment-16675200
 ] 

Allan Yang commented on HBASE-21314:


Got it, +1 for the V3 patch. 

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT

2018-11-05 Thread Jingyun Tian (JIRA)
Jingyun Tian created HBASE-21437:


 Summary: Bypassed procedure throw IllegalArgumentException when 
its state is WAITING_TIMEOUT
 Key: HBASE-21437
 URL: https://issues.apache.org/jira/browse/HBASE-21437
 Project: HBase
  Issue Type: Bug
Reporter: Jingyun Tian
Assignee: Jingyun Tian


{code}
2018-11-05,18:25:52,735 WARN 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating 
UNNATURALLY null
java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, 
state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, bypass=true; 
TransitRegionStateProcedure table=test_fail
over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN
at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948)
2018-11-05,18:25:52,736 TRACE 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated.
{code}

Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state 
is still WAITING_TIMEOUT, then when executor run this procedure, it will throw 
exception and cause worker terminated.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675194#comment-16675194
 ] 

Allan Yang commented on HBASE-21314:


Got it, +1 for the V3 patch. 

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675176#comment-16675176
 ] 

Duo Zhang commented on HBASE-21314:
---

Add comments in isDeleted and isModified.

And for the Math.abs, replied on RB:
{quote}
Because the BitSetNode can grow to both direction. If to left you should check 
with the end, while to right you should check with the start. So not a simple 
abs.
{quote}
Thanks.

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21314:
--
Attachment: HBASE-21314-v1.patch

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, 
> HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21420) Use procedure event to wake up the SyncReplicationReplayWALProcedures which wait for worker

2018-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21420:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to master.

Thanks [~zghaobac] for reviewing.

> Use procedure event to wake up the SyncReplicationReplayWALProcedures which 
> wait for worker
> ---
>
> Key: HBASE-21420
> URL: https://issues.apache.org/jira/browse/HBASE-21420
> Project: HBase
>  Issue Type: Sub-task
>  Components: Replication
>Reporter: Guanghao Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21420-v1.patch, HBASE-21420-v2.patch, 
> HBASE-21420.patch
>
>
> Now if a SyncReplicationReplayWALProcedure failed to get a worker, it will 
> sleep backoff and retry. So when the finished 
> SyncReplicationReplayWALProcedure release a new worker, it will take a long 
> time to run and get the worker to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675156#comment-16675156
 ] 

Duo Zhang commented on HBASE-21421:
---

{code}
LOG.warn("Failed to checkOnlineRegionsReport, maybe due to network log, "
{code}
'log' to 'lag'. And please remove the empty '@throw Exception', it will cause a 
checkstyle warning I think.

No other problem. +1 after you fix these issues.

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21420) Use procedure event to wake up the SyncReplicationReplayWALProcedures which wait for worker

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675120#comment-16675120
 ] 

Hadoop QA commented on HBASE-21420:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
42s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
31s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
21s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  3m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
19s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
11m 37s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
2m  4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
38s{color} | {color:green} hbase-protocol-shaded in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
26s{color} | {color:green} hbase-replication in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
40s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}162m  
1s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}229m 33s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21420 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946885/HBASE-21420-v2

[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675100#comment-16675100
 ] 

Allan Yang commented on HBASE-21314:


{code}
-return Math.abs(procId - start) < MAX_NODE_SIZE;
{code}
Why not use abs? 
Can you add a comment in isDeleted and isModified here:
{code}
modified[wordIndex] & (1L << bitmapIndex)
{code}
We all are confused about the left shift before( which can more than 64 times)
The patch looks good to me.

> The implementation of BitSetNode is not efficient
> -
>
> Key: HBASE-21314
> URL: https://issues.apache.org/jira/browse/HBASE-21314
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21314.patch, HBASE-21314.patch
>
>
> As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we 
> could only have one word(long) for each BitSetNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675087#comment-16675087
 ] 

Allan Yang commented on HBASE-21421:


[~Apache9], uploaded a V2 to address your advice, thanks.

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21421:
---
Attachment: HBASE-21421.branch-2.0.003.patch

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675082#comment-16675082
 ] 

Allan Yang commented on HBASE-21423:


Pushed to branch-2.1 and branch-2.0. Thanks for reviewing, [~Apache9].

> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21423:
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.0.3
   Status: Resolved  (was: Patch Available)

> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)

2018-11-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675051#comment-16675051
 ] 

Hadoop QA commented on HBASE-21255:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 8 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
15s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
24s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} The patch passed checkstyle in hbase-common {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
33s{color} | {color:red} hbase-client: The patch generated 6 new + 79 unchanged 
- 30 fixed = 85 total (was 109) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} hbase-server: The patch generated 0 new + 101 
unchanged - 53 fixed = 101 total (was 154) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
12s{color} | {color:green} The patch passed checkstyle in hbase-rsgroup {color} 
|
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
14s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
11m  4s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
31s{color} | {color:green} hbase-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
8s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}150m 
16s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
26s{color} | {color:green} hbase-rsgroup in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} 

[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675050#comment-16675050
 ] 

Duo Zhang commented on HBASE-21421:
---

Please log the full stack trace instead of e.getMessage(not your fault), and 
let's remove the code instead of commenting out?

> Do not kill RS if reportOnlineRegions fails
> ---
>
> Key: HBASE-21421
> URL: https://issues.apache.org/jira/browse/HBASE-21421
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21421.branch-2.0.001.patch, 
> HBASE-21421.branch-2.0.002.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on

2018-11-05 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675047#comment-16675047
 ] 

Allan Yang commented on HBASE-21395:


Pushed to branch-2.0 and branch-2.1, thanks for reviewing, [~stack],[~xucang].

> Abort split/merge procedure if there is a table procedure of the same table 
> going on
> 
>
> Key: HBASE-21395
> URL: https://issues.apache.org/jira/browse/HBASE-21395
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21395.branch-2.0.001.patch, 
> HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, 
> HBASE-21395.branch-2.0.004.patch
>
>
> In my ITBLL, I often see that if split/merge procedure and table 
> procedure(like ModifyTableProcedure) happen at the same time, and since there 
> some race conditions between these two kind of procedures,  causing some 
> serious problems. e.g. the split/merged parent is bought on line by the table 
> procedure or the split merged region making the whole table procedure 
> rollback.
> Talked with [~Apache9] offline today, this kind of problem was solved in 
> branch-2+ since There is a fence that only one RTSP can agianst a single 
> region at the same time.
> To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe 
> fence in the split/merge procedure: If there is a table procedure going on 
> against the same table, then abort the split/merge procedure. Aborting the 
> split/merge procedure at the beginning of the execution is no big deal, 
> compared with the mess it will cause...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on

2018-11-05 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21395:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Abort split/merge procedure if there is a table procedure of the same table 
> going on
> 
>
> Key: HBASE-21395
> URL: https://issues.apache.org/jira/browse/HBASE-21395
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21395.branch-2.0.001.patch, 
> HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, 
> HBASE-21395.branch-2.0.004.patch
>
>
> In my ITBLL, I often see that if split/merge procedure and table 
> procedure(like ModifyTableProcedure) happen at the same time, and since there 
> some race conditions between these two kind of procedures,  causing some 
> serious problems. e.g. the split/merged parent is bought on line by the table 
> procedure or the split merged region making the whole table procedure 
> rollback.
> Talked with [~Apache9] offline today, this kind of problem was solved in 
> branch-2+ since There is a fence that only one RTSP can agianst a single 
> region at the same time.
> To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe 
> fence in the split/merge procedure: If there is a table procedure going on 
> against the same table, then abort the split/merge procedure. Aborting the 
> split/merge procedure at the beginning of the execution is no big deal, 
> compared with the mess it will cause...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >