[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21421: --- Resolution: Fixed Fix Version/s: 2.1.2 2.0.3 3.0.0 Status: Resolved (was: Patch Available) Pushed to branch-2.0+, thanks for reviewing,[~Apache9]. > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 3.0.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, > HBASE-21421.branch-2.0.004.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted
[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676234#comment-16676234 ] Allan Yang commented on HBASE-21440: Yes, I think it is possible. But it should be a very rare case. Since if an AssignProcedure is in REGION_TRANSITION_FINISH state, it should have finished long time ago before the corresponding SCP begin to handle RIT. But, anyway, it can happen, e.g. the meta table is also on the crashed server, so the AssignProcedure is stucking there retrying to update meta, while sleeping, the SCP called remoteCallFailed on this AssignProcedure. I think we can do this. if the procedure found in handleRIT is a assign procedure, and it is in REGION_TRANSITION_FINISH , we should not remove it from regions to assign list. We can count on AssignProcedure's state, since, if it do not enter REGION_TRANSITION_FINISH state when handleRIT, it won't have a chance later(the server is already dead). > Assign procedure on the crashed server is not properly interrupted > -- > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.2 > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21439) StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost functions
[ https://issues.apache.org/jira/browse/HBASE-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676182#comment-16676182 ] stack commented on HBASE-21439: --- bq. Will post a patch assuming we want to pursue the original intention... Makes sense [~benlau]. Thanks. Do we make this same mistake elsewhere in codebase? Thanks. > StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost > functions > - > > Key: HBASE-21439 > URL: https://issues.apache.org/jira/browse/HBASE-21439 > Project: HBase > Issue Type: Bug > Components: Balancer >Affects Versions: 1.3.2.1, 2.0.2 >Reporter: Ben Lau >Assignee: Ben Lau >Priority: Major > > In StochasticLoadBalancer.updateRegionLoad() the region loads are being put > into the map with Bytes.toString(regionName). > First, this is a problem because Bytes.toString() assumes that the byte array > is a UTF8 encoded String but there is no guarantee that regionName bytes are > legal UTF8. > Secondly, in BaseLoadBalancer.registerRegion, we are reading the region loads > out of the load map not using Bytes.toString() but using > region.getRegionNameAsString() and region.getEncodedName(). So the load > balancer will not see or use any of the cluster's RegionLoad history. > There are 2 primary ways to solve this issue, assuming we want to stay with > String keys for the load map (seems reasonable to aid debugging). We can > either fix updateRegionLoad to store the regionName as a string properly or > we can update both the reader & writer to use a new common valid String > representation. > Will post a patch assuming we want to pursue the original intention, i.e. > store regionNameAsAString for the loadmap key, but I'm open to fixing this a > different way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21430) [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo
[ https://issues.apache.org/jira/browse/HBASE-21430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676164#comment-16676164 ] stack commented on HBASE-21430: --- I put a notice up on dev list that will change gitbox so updates are instead mail to issues@ rather than JIRA comments. Here's the INFRA: INFRA-16886 > [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo > - > > Key: HBASE-21430 > URL: https://issues.apache.org/jira/browse/HBASE-21430 > Project: HBase > Issue Type: Bug > Components: hbase-connectors, spark >Reporter: stack >Assignee: stack >Priority: Major > > Exploring moving the spark modules out of core hbase and into > hbase-connectors. Perhaps spark is deserving of its own repo (I think > [~busbey] was on about this) but meantime, experimenting w/ having it out in > hbase-connectors. > Here is thread on spark integration > https://lists.apache.org/thread.html/fd74ef9b9da77abf794664f06ea19c839fb3d543647fb29115081683@%3Cdev.hbase.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21441) NPE if RS crashes between REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE
[ https://issues.apache.org/jira/browse/HBASE-21441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21441: -- Description: {noformat} 2018-11-06,12:55:25,980 WARN [RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer TestPeer for TRANSIT_SYNC_REPLICATION_STATE on c4-hadoop-tst-st54.bj,17200,1541479922465 failed java.lang.NullPointerException via c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: at org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124) at org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303) at java.util.ArrayList.forEach(ArrayList.java:1249) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) Caused by: java.lang.NullPointerException: at org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303) at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34) at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} was: 2018-11-06,12:55:25,980 WARN [RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer TestPeer for TRANSIT_SYNC_REPLICATION_STATE on c4-hadoop-tst-st54.bj,17200,1541479922465 failed java.lang.NullPointerException via c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: at org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124) at org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303) at java.util.ArrayList.forEach(ArrayList.java:1249) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) Caused by: java.lang.NullPointerException: at org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303) at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34) at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > NPE if RS crashes between REFRESH_PEER_
[jira] [Created] (HBASE-21441) NPE if RS crashes between REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE
Duo Zhang created HBASE-21441: - Summary: NPE if RS crashes between REFRESH_PEER_SYNC_REPLICATION_STATE_ON_RS_BEGIN and TRANSIT_PEER_NEW_SYNC_REPLICATION_STATE Key: HBASE-21441 URL: https://issues.apache.org/jira/browse/HBASE-21441 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Duo Zhang Fix For: 3.0.0 2018-11-06,12:55:25,980 WARN [RpcServer.default.FPBQ.Fifo.handler=251,queue=11,port=17100] org.apache.hadoop.hbase.master.replication.RefreshPeerProcedure: Refresh peer TestPeer for TRANSIT_SYNC_REPLICATION_STATE on c4-hadoop-tst-st54.bj,17200,1541479922465 failed java.lang.NullPointerException via c4-hadoop-tst-st54.bj,17200,1541479922465:java.lang.NullPointerException: at org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:124) at org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2303) at java.util.ArrayList.forEach(ArrayList.java:1249) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2298) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13149) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) Caused by: java.lang.NullPointerException: at org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.peerSyncReplicationStateChange(SyncReplicationWALProvider.java:303) at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.transitSyncReplicationPeerState(PeerProcedureHandlerImpl.java:216) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:74) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:34) at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:47) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676107#comment-16676107 ] Hudson commented on HBASE-21423: Results for branch branch-2.0 [build #1061 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1061//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676090#comment-16676090 ] liubangchen commented on HBASE-21381: - Not at all . If some thing I can do it just say . Thanks. > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21381: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Thanks for the patch, liubang. Thanks for the review, Wei-Chiu > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676067#comment-16676067 ] Wei-Chiu Chuang commented on HBASE-21381: - +1 (non-binding) Thanks! > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676063#comment-16676063 ] Reid Chan commented on HBASE-21246: --- I'm reading the diagrams, thanks for the work! > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)
[ https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676057#comment-16676057 ] Reid Chan commented on HBASE-21255: --- Thanks Sean! Would you mind taking a look at {{ShadedAccessControlUtil.class}}, i think checkstyle of this class can be ignored and acceptable, WDYT? > [acl] Refactor TablePermission into three classes (Global, Namespace, Table) > > > Key: HBASE-21255 > URL: https://issues.apache.org/jira/browse/HBASE-21255 > Project: HBase > Issue Type: Improvement >Reporter: Reid Chan >Assignee: Reid Chan >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21225.master.001.patch, > HBASE-21225.master.002.patch, HBASE-21255.master.003.patch, > HBASE-21255.master.004.patch, HBASE-21255.master.005.patch, > HBASE-21255.master.006.patch > > > A TODO in {{TablePermission.java}} > {code:java} > //TODO refactor this class > //we need to refacting this into three classes (Global, Table, Namespace) > {code} > Change Notes: > * Divide origin TablePermission into three classes GlobalPermission, > NamespacePermission, TablePermission > * New UserPermission consists of a user name and a permission in one of > [Global, Namespace, Table]Permission. > * Rename TableAuthManager to AuthManager(it is IA.P), and rename some > methods for readability. > * Make PermissionCache thread safe, and the ListMultiMap is changed to Set. > * User cache and group cache in AuthManager is combined together. > * Wire proto is kept, BC should be under guarantee. > * Fix HBASE-21390. > * Resolve a small {{TODO}} global entry should be handled differently in > AccessControlLists -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676052#comment-16676052 ] Ted Yu commented on HBASE-21347: lgtm > Backport HBASE-21200 "Memstore flush doesn't finish because of > seekToPreviousRow() in memstore scanner." to branch-1 > > > Key: HBASE-21347 > URL: https://issues.apache.org/jira/browse/HBASE-21347 > Project: HBase > Issue Type: Sub-task > Components: backport, Scanners >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Critical > Attachments: HBASE-21347.branch-1.001.patch > > > Backport parent issue to branch-1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676010#comment-16676010 ] Hadoop QA commented on HBASE-21381: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 48s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 4m 56s{color} | {color:blue} branch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 4m 55s{color} | {color:blue} patch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 20m 8s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21381 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947001/HBASE-21381-3.patch | | Optional Tests | dupname asflicense refguide | | uname | Linux 8d0b2729e890 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 01603278a3 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | refguide | https://builds.apache.org/job/PreCommit-HBASE-Build/14960/artifact/patchprocess/branch-site/book.html | | refguide | https://builds.apache.org/job/PreCommit-HBASE-Build/14960/artifact/patchprocess/patch-site/book.html | | Max. process+thread count | 97 (vs. ulimit of 1) | | modules | C: . U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/14960/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails
[ https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676002#comment-16676002 ] Allan Yang commented on HBASE-19953: [~elserj], actually postDeleteNamespace still have the race condition you mentioned above... Since the sync latch is released just after DELETE_NAMESPACE_PREPARE state in DeleteNamespaceProcedure. So when the time postDeleteNamespace is called, the namespace may also exists... But anyway, let me open another issue for it, the thing I want to solve most is the RPC timeout of modifying tables. > Avoid calling post* hook when procedure fails > - > > Key: HBASE-19953 > URL: https://issues.apache.org/jira/browse/HBASE-19953 > Project: HBase > Issue Type: Bug > Components: master, proc-v2 >Reporter: Ramesh Mani >Assignee: Josh Elser >Priority: Critical > Fix For: 2.0.0-beta-2, 2.0.0 > > Attachments: HBASE-19952.001.branch-2.patch, > HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, > HBASE-19953.branch-2.0.addendum.patch > > > Ramesh pointed out a case where I think we're mishandling some post\* > MasterObserver hooks. Specifically, I'm looking at the deleteNamespace. > We synchronously execute the DeleteNamespace procedure. When the user > provides a namespace that isn't empty, the procedure does a rollback (which > is just a no-op), but this doesn't propagate an exception up to the > NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh > pointing it out a bit better to me that the code executes a bit differently > than we actually expect. > I think we need to double-check our post hooks and make sure we aren't > invoking them when the procedure actually failed. cc/ [~Apache9], [~stack]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675997#comment-16675997 ] Toshihiro Suzuki commented on HBASE-21347: -- Could you please review the patch? [~elserj] [~yuzhih...@gmail.com] [~apurtell] > Backport HBASE-21200 "Memstore flush doesn't finish because of > seekToPreviousRow() in memstore scanner." to branch-1 > > > Key: HBASE-21347 > URL: https://issues.apache.org/jira/browse/HBASE-21347 > Project: HBase > Issue Type: Sub-task > Components: backport, Scanners >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Critical > Attachments: HBASE-21347.branch-1.001.patch > > > Backport parent issue to branch-1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liubangchen updated HBASE-21381: Attachment: HBASE-21381-3.patch > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675995#comment-16675995 ] Allan Yang commented on HBASE-21421: OK, let me fix the checkstyle on commit > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, > HBASE-21421.branch-2.0.004.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21314: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to branch-2.0+. Thanks [~allan163] for reviewing. > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21438: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.1.2 2.0.3 2.2.0 3.0.0 Status: Resolved (was: Patch Available) Pushed to branch-2.0+. Thanks [~tedyu]. > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675963#comment-16675963 ] Duo Zhang commented on HBASE-21247: --- If this is a bug then we should push it to all related branches? > Custom WAL Provider cannot be specified by configuration whose value is > outside the enums in Providers > -- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for additional WAL Providers to be supplied - by > class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v6.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: two-pass-cleaner.v6.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675949#comment-16675949 ] Duo Zhang commented on HBASE-21438: --- +1. > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20952) Re-visit the WAL API
[ https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675944#comment-16675944 ] Hudson commented on HBASE-20952: Results for branch HBASE-20952 [build #40 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/40//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Re-visit the WAL API > > > Key: HBASE-20952 > URL: https://issues.apache.org/jira/browse/HBASE-20952 > Project: HBase > Issue Type: Improvement > Components: wal >Reporter: Josh Elser >Priority: Major > Attachments: 20952.v1.txt > > > Take a step back from the current WAL implementations and think about what an > HBase WAL API should look like. What are the primitive calls that we require > to guarantee durability of writes with a high degree of performance? > The API needs to take the current implementations into consideration. We > should also have a mind for what is happening in the Ratis LogService (but > the LogService should not dictate what HBase's WAL API looks like RATIS-272). > Other "systems" inside of HBase that use WALs are replication and > backup&restore. Replication has the use-case for "tail"'ing the WAL which we > should provide via our new API. B&R doesn't do anything fancy (IIRC). We > should make sure all consumers are generally going to be OK with the API we > create. > The API may be "OK" (or OK in a part). We need to also consider other methods > which were "bolted" on such as {{AbstractFSWAL}} and > {{WALFileLengthProvider}}. Other corners of "WAL use" (like the > {{WALSplitter}} should also be looked at to use WAL-APIs only). > We also need to make sure that adequate interface audience and stability > annotations are chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675900#comment-16675900 ] Ted Yu edited comment on HBASE-21387 at 11/6/18 12:47 AM: -- In two-pass-cleaner.v6.txt , the reference to previous round is changed to Set. The length of file is needed by HFileCleaner was (Author: yuzhih...@gmail.com): In two-pass-cleaner.v5.txt , the reference to previous round is changed to Set. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v6.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: two-pass-cleaner.v5.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted
[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Singhal updated HBASE-21440: -- Description: When the server crashes, it's SCP checks if there is already a procedure assigning the region on this crashed server. If we found one, SCP will just interrupt the already running AssignProcedure by calling remoteCallFailed which internally just changes the region node state to OFFLINE and send the procedure back with transition queue state for assignment with a new plan. But, due to the race condition between the calling of the remoteCallFailed and current state of the already running assign procedure(REGION_TRANSITION_FINISH: where the region is already opened), it is possible that assign procedure goes ahead in updating the regionStateNode to OPEN on a crashed server. As SCP had already skipped this region for assignment as it was relying on existing assign procedure to do the right thing, this whole confusion leads region to a not accessible state. was: When the server crashes and it's SCP checks if there is already a procedure assigning the region on this crashed server. If we found one, SCP will just interrupt the already running AssignProcedure by calling remoteCallFailed which just changes the region node state to OFFLINE and send the procedure back with transition queue state for assignment with a new plan. But, due to the race condition between the calling of the remoteCallFailed and current state of the already running assign procedure(REGION_TRANSITION_FINISH: where the region is already opened), it is possible that assign procedure goes ahead in updating the regionStateNode to OPEN on a crashed server. As SCP had already skipped this region for assignment as it was relying on existing assign procedure to do the right thing, this whole confusion leads region to a not accessible state. > Assign procedure on the crashed server is not properly interrupted > -- > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.2 > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted
Ankit Singhal created HBASE-21440: - Summary: Assign procedure on the crashed server is not properly interrupted Key: HBASE-21440 URL: https://issues.apache.org/jira/browse/HBASE-21440 Project: HBase Issue Type: Bug Reporter: Ankit Singhal Assignee: Ankit Singhal Fix For: 2.0.2 When the server crashes and it's SCP checks if there is already a procedure assigning the region on this crashed server. If we found one, SCP will just interrupt the already running AssignProcedure by calling remoteCallFailed which just changes the region node state to OFFLINE and send the procedure back with transition queue state for assignment with a new plan. But, due to the race condition between the calling of the remoteCallFailed and current state of the already running assign procedure(REGION_TRANSITION_FINISH: where the region is already opened), it is possible that assign procedure goes ahead in updating the regionStateNode to OPEN on a crashed server. As SCP had already skipped this region for assignment as it was relying on existing assign procedure to do the right thing, this whole confusion leads region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT
[ https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675912#comment-16675912 ] Jingyun Tian commented on HBASE-21437: -- [~allan163] Thanks for your comment. My initial idea is remove the procedure from delay queue and add a condition check of bypass when execute procedure. Your idea can avoid modify code of PE. Let me try this out. > Bypassed procedure throw IllegalArgumentException when its state is > WAITING_TIMEOUT > --- > > Key: HBASE-21437 > URL: https://issues.apache.org/jira/browse/HBASE-21437 > Project: HBase > Issue Type: Bug >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Major > > {code} > 2018-11-05,18:25:52,735 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating > UNNATURALLY null > java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, > state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, > bypass=true; TransitRegionStateProcedure table=test_fail > over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948) > 2018-11-05,18:25:52,736 TRACE > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated. > {code} > Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state > is still WAITING_TIMEOUT, then when executor run this procedure, it will > throw exception and cause worker terminated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675900#comment-16675900 ] Ted Yu commented on HBASE-21387: In two-pass-cleaner.v5.txt , the reference to previous round is changed to Set. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v5.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21432) [hbase-connectors] Add Apache Yetus integration for hbase-connectors repository
[ https://issues.apache.org/jira/browse/HBASE-21432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675886#comment-16675886 ] Sean Busbey commented on HBASE-21432: - How about just PRs? Right now yetus has no notion of multiple repositories corresponding to a single JIRA tracker, so if we want patches on JIRA to work we'll need to add our own logic for it. By comparison, a job to periodically check for open PRs and then test them should be much simpler. > [hbase-connectors] Add Apache Yetus integration for hbase-connectors > repository > > > Key: HBASE-21432 > URL: https://issues.apache.org/jira/browse/HBASE-21432 > Project: HBase > Issue Type: Task > Components: build, hbase-connectors >Affects Versions: connector-1.0.0 >Reporter: Peter Somogyi >Priority: Major > > Add automated testing for pull requests and patch files created for > hbase-connectors repository. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21439) StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost functions
Ben Lau created HBASE-21439: --- Summary: StochasticLoadBalancer RegionLoads aren’t being used in RegionLoad cost functions Key: HBASE-21439 URL: https://issues.apache.org/jira/browse/HBASE-21439 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 2.0.2, 1.3.2.1 Reporter: Ben Lau Assignee: Ben Lau In StochasticLoadBalancer.updateRegionLoad() the region loads are being put into the map with Bytes.toString(regionName). First, this is a problem because Bytes.toString() assumes that the byte array is a UTF8 encoded String but there is no guarantee that regionName bytes are legal UTF8. Secondly, in BaseLoadBalancer.registerRegion, we are reading the region loads out of the load map not using Bytes.toString() but using region.getRegionNameAsString() and region.getEncodedName(). So the load balancer will not see or use any of the cluster's RegionLoad history. There are 2 primary ways to solve this issue, assuming we want to stay with String keys for the load map (seems reasonable to aid debugging). We can either fix updateRegionLoad to store the regionName as a string properly or we can update both the reader & writer to use a new common valid String representation. Will post a patch assuming we want to pursue the original intention, i.e. store regionNameAsAString for the loadmap key, but I'm open to fixing this a different way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675871#comment-16675871 ] Josh Elser commented on HBASE-21387: {quote} bq.where a snapshot was "orphaned" and prevent file cleaning from happening I think by "orphaned" you are talking about not just two iterations for cleaner chore but many iterations. In that case, the situation in the current code base would prevent cleaning hfiles referenced, as well. {quote} Yes, that's the situation I mean. Perhaps we shouldn't be overly concerned about this one. I certainly think it is the cleaner approach (and something we can more easily reason about). > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675869#comment-16675869 ] Josh Elser commented on HBASE-21246: [~stack], [~reidchan], see Ted's attached images !replication-src-creates-wal-reader.jpg! !wal-splitter-reader.jpg! !wal-splitter-writer.jpg! for ReplicationSource reading, WALSplitter reading, and WALSplitter writing, respectively. I'm worried we've swung too far back in the other direction (too simple), but I also am at a loss of what to suggest Ted provide, too. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675867#comment-16675867 ] Ted Yu commented on HBASE-21387: bq. holding onto file names in memory We don't need to continue referencing FileStatus from the previous pass. Path (or String) for each file would be sufficient. bq. where a snapshot was "orphaned" and prevent file cleaning from happening I think by "orphaned" you are talking about not just two iterations for cleaner chore but many iterations. In that case, the situation in the current code base would prevent cleaning hfiles referenced, as well. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21436) Getting OOM frequently if hold many regions
[ https://issues.apache.org/jira/browse/HBASE-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675865#comment-16675865 ] Vladimir Rodionov commented on HBASE-21436: --- Hi, I do not see anything worth of fixing here. First of all, 2MB per Region is the overhead customer must accept and live with. Second of all, 3K regions per RS is too high for 4GB heap size - this goes against best practices and recommendations. > Getting OOM frequently if hold many regions > > > Key: HBASE-21436 > URL: https://issues.apache.org/jira/browse/HBASE-21436 > Project: HBase > Issue Type: Improvement > Components: regionserver >Affects Versions: 1.4.8, 2.1.1, 2.0.2 >Reporter: Zephyr Guo >Priority: Major > Attachments: HBASE-21436-UT.patch > > > Recently, some feedback reached me from a customer which complains about > NotServingRegionException thrown out at intevals. I examined his cluster and > found there were quite a lot of OOM logs there but throughtput is in quite > low level. In this customer's case, each RS has 3k regions and heap size of > 4G. I dumped heap when OOM took place, and found that a lot of Chunk objects > (counts as much as 1700) was there. > Eventually, piecing all these evidences together, I came to the conclusion > that: > * The root cause is that global flush is triggered by size of all memstores, > rather than size of all chunks. > * A chunk is always allocated for each region, even we only write a few data > to the region. > And in this case, a total of 3.4G memory was consumed by 1700 chunks, > although throughput is very low. > Although 3K regions is too much for RS with 4G memory, it is still wise to > improve RS stability in such scenario (In fact, most customers buy a small > size HBase on cloud side). > > I provide a patch (only contain UT) to reproduce this case (just send a > batch). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675857#comment-16675857 ] Josh Elser commented on HBASE-21387: {quote} Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track of the files deemed cleanable from previous iteration. Only files deemed cleanable from previous and current iterations would be deleted. {quote} I think that would help, but I'm not sure if that's the best way to go about it. We're holding onto file names in memory (which could get big) with your two-pass-cleaner.v4 patch. If we made a change to not clean files while there is an in-progress snapshot, we would be in trouble if we ever got into a situation where a snapshot was "orphaned" and prevent file cleaning from happening. I'm not sure which I think is better (or if there's even still something better out there..) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675805#comment-16675805 ] Ted Yu commented on HBASE-21387: Solving the in progress snapshot race condition is tricky. Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track of the files deemed cleanable from previous iteration. Only files deemed cleanable from previous and current iterations would be deleted. This is a bigger change. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v4.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Open (was: Patch Available) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675799#comment-16675799 ] Josh Elser commented on HBASE-21387: Ok, cool. Thanks for confirming, Ted. I can appreciate where your fix is coming from, but I'm still suspect of this being a complete fix. In the current implementation of SnapshotFileCache#getSnapshotsInProgress(..), we acquire the lock on the in-progress snapshot before listing the files for it. That means the call to getSnapshotsInProgress(..) will block until the operation is complete (both for online and offline snapshot generation). So, we should never have a case where we read a snapshot's files while it's in the process of being written. However, it does seems like there could be a case where a snapshot we knew to be in-progress finishes before the SnapshotFileCleaner "wakes up" (e.g. after TakeSnapshotHandler.completeSnapshot(..) is invoked). We may miss this newly created snapshot (which was in-progress when we started SnapshotFileCache#getUnreferencedFiles but is complete when we finish it). The bigger problem is that there is still the potential for the SnapshotFileCleaner to go to sleep during {{getSnapshotsInProgress}} (in the current code or your patch) and miss a newly started snapshot. I do not think we can safely identify the files to retain for snapshots without precluding the submission of new snapshots. I think your v8 patch improves this situation but does not completely solve it. WDYT? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675762#comment-16675762 ] Ted Yu commented on HBASE-21381: liubang: 2.7.x and 2.8.y should be added as supported hadoop releases since they were not affected by HADOOP-11794 > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675738#comment-16675738 ] Ted Yu commented on HBASE-21387: Thanks for giving the timeline, Josh. The scenario you described is the race condition I am solving with patch v8. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675711#comment-16675711 ] Josh Elser commented on HBASE-21387: Looking through v8.. Started off still struggling to understand the race condition, but I think I see it now. At time T0, we are checking if F1 is referenced. At time T1, there is a snapshot S1 in progress that is referencing a file F1. refreshCache() is called, but no completed snapshot references F1. At T2, the snapshot S1, which references F1, completes. At T3, we check in-progress snapshots and S1 is not included. Thus, F1 is marked as unreferenced even though S1 references it. This is what you are saying is the issue, Ted? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675645#comment-16675645 ] Hadoop QA commented on HBASE-21438: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 4s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 1s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 49s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 20s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 10s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 34m 10s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21438 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946949/21438.v1.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 090dcd9ba515 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 7395ffac44 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 |
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675576#comment-16675576 ] Wei-Chiu Chuang commented on HBASE-21381: - Hadoop 2.7.x and 2.8.x should also work with B&R, I suppose? > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21438: --- Attachment: 21438.v1.txt > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21438: --- Status: Patch Available (was: Open) > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675573#comment-16675573 ] Ted Yu commented on HBASE-21438: Ran TestAdmin2 with patch which passed. > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
Ted Yu created HBASE-21438: -- Summary: TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible Key: HBASE-21438 URL: https://issues.apache.org/jira/browse/HBASE-21438 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu >From >https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : {code} Mon Nov 05 04:52:13 UTC 2018, RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, org.apache.hadoop.hbase.procedure2.BadProcedureException: org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and have an empty constructor at org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) at org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) at org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675530#comment-16675530 ] Ted Yu commented on HBASE-21381: lgtm > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task >Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675528#comment-16675528 ] Ted Yu edited comment on HBASE-21246 at 11/5/18 6:20 PM: - In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider in WALFactory since the Reader creation is done by the provider. was (Author: yuzhih...@gmail.com): In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider since the Reader creation is done by the provider. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675528#comment-16675528 ] Ted Yu commented on HBASE-21246: In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider since the Reader creation is done by the provider. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: 21246.24.txt > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails
[ https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675476#comment-16675476 ] Josh Elser commented on HBASE-19953: [~allan163], I understand the RPC timeout issue that you bring up, but I think the right solution is to fix how we execute postHooks. If we applied your patch now, it would introduce a regression for the original issue that I created this change. Specifically, that issue is that when a 2.x client (which has procv2 support) submits an operation that doesn't use a blocking lock, there is a race condition between the execution of the postHook and the procedure for the operation itself. Specifically, the original problem was that an Apache Ranger-owned postDeleteNamespace hook was expecting that the HBase namespace would not exist when it was invoked. In reality, the namespace still existed because the DeleteNamespaceProcedure had not yet run when we used the non-blocking latch. Your patch would reintroduce this problem directly. So, there are two goals: one (that you raised) is that we want clients to not have to hold open an RPC for a procedure to complete. The second (that I raised here) is that we must not call a post\* hook before the actual operation is completed. I think the way to solve both of these is to push execution of the post\* hook into the procedure itself. That way, we do not execute the hook too "soon" and the client can just poll the procedure, waiting for completion (as we want). > Avoid calling post* hook when procedure fails > - > > Key: HBASE-19953 > URL: https://issues.apache.org/jira/browse/HBASE-19953 > Project: HBase > Issue Type: Bug > Components: master, proc-v2 >Reporter: Ramesh Mani >Assignee: Josh Elser >Priority: Critical > Fix For: 2.0.0-beta-2, 2.0.0 > > Attachments: HBASE-19952.001.branch-2.patch, > HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, > HBASE-19953.branch-2.0.addendum.patch > > > Ramesh pointed out a case where I think we're mishandling some post\* > MasterObserver hooks. Specifically, I'm looking at the deleteNamespace. > We synchronously execute the DeleteNamespace procedure. When the user > provides a namespace that isn't empty, the procedure does a rollback (which > is just a no-op), but this doesn't propagate an exception up to the > NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh > pointing it out a bit better to me that the code executes a bit differently > than we actually expect. > I think we need to double-check our post hooks and make sure we aren't > invoking them when the procedure actually failed. cc/ [~Apache9], [~stack]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)
[ https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675484#comment-16675484 ] Sean Busbey commented on HBASE-21255: - the precommit failures were probably the openjdk / surefire conflict that got handled in HBASE-21417 (and it looks like they're clear now). please fix checkstyle. > [acl] Refactor TablePermission into three classes (Global, Namespace, Table) > > > Key: HBASE-21255 > URL: https://issues.apache.org/jira/browse/HBASE-21255 > Project: HBase > Issue Type: Improvement >Reporter: Reid Chan >Assignee: Reid Chan >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21225.master.001.patch, > HBASE-21225.master.002.patch, HBASE-21255.master.003.patch, > HBASE-21255.master.004.patch, HBASE-21255.master.005.patch, > HBASE-21255.master.006.patch > > > A TODO in {{TablePermission.java}} > {code:java} > //TODO refactor this class > //we need to refacting this into three classes (Global, Table, Namespace) > {code} > Change Notes: > * Divide origin TablePermission into three classes GlobalPermission, > NamespacePermission, TablePermission > * New UserPermission consists of a user name and a permission in one of > [Global, Namespace, Table]Permission. > * Rename TableAuthManager to AuthManager(it is IA.P), and rename some > methods for readability. > * Make PermissionCache thread safe, and the ListMultiMap is changed to Set. > * User cache and group cache in AuthManager is combined together. > * Wire proto is kept, BC should be under guarantee. > * Fix HBASE-21390. > * Resolve a small {{TODO}} global entry should be handled differently in > AccessControlLists -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675458#comment-16675458 ] Ted Yu commented on HBASE-21246: replication-src-creates-wal-reader.jpg shows how WAL Reader is created for replication source: ReplicationSource calls walProvider#getWalStream which returns WALEntryStream. AbstractWALEntryStream#createReader calls WALProvider#createReader wal-splitter-reader.jpg shows how WAL Reader is created for log splitting: WALSplitter#getReader calls WALProvider#createReader. Below WALProvider, AbstractFSWALProvider and DisabledWALProvider are shown which implement WALProvider interface. AsyncFSWALProvider and FSHLogProvider extend AbstractFSWALProvider wal-splitter-writer.jpg shows how WAL Writer is created for log splitting. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: replication-src-creates-wal-reader.jpg > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: wal-splitter-writer.jpg wal-splitter-reader.jpg > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on
[ https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675433#comment-16675433 ] Hudson commented on HBASE-21395: Results for branch branch-2.1 [build #580 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Abort split/merge procedure if there is a table procedure of the same table > going on > > > Key: HBASE-21395 > URL: https://issues.apache.org/jira/browse/HBASE-21395 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21395.branch-2.0.001.patch, > HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, > HBASE-21395.branch-2.0.004.patch > > > In my ITBLL, I often see that if split/merge procedure and table > procedure(like ModifyTableProcedure) happen at the same time, and since there > some race conditions between these two kind of procedures, causing some > serious problems. e.g. the split/merged parent is bought on line by the table > procedure or the split merged region making the whole table procedure > rollback. > Talked with [~Apache9] offline today, this kind of problem was solved in > branch-2+ since There is a fence that only one RTSP can agianst a single > region at the same time. > To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe > fence in the split/merge procedure: If there is a table procedure going on > against the same table, then abort the split/merge procedure. Aborting the > split/merge procedure at the beginning of the execution is no big deal, > compared with the mess it will cause... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675434#comment-16675434 ] Hudson commented on HBASE-21423: Results for branch branch-2.1 [build #580 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/580//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19953) Avoid calling post* hook when procedure fails
[ https://issues.apache.org/jira/browse/HBASE-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675435#comment-16675435 ] Josh Elser commented on HBASE-19953: {quote} After a careful think, I think we don't need to revert this(I've already changed the comment above), we only need to turn ModifyTable to a async op(Which is the only sync DDL for 2.x client now). Josh Elser you can see HMaster.truncateTable. We also use ProcedurePrepareLatch.createLatch(2, 0) to make sure the 2.x client won't sync wait here. Uploaded a addendum to clarify my point. {quote} Thanks, Allan. I appreciate the clarification -- I felt bad lobbing a "-1" onto this as I did. Glad we can talk it out. The addendum is very helpful and I appreciate that too. I need to refresh my head around the original problem and think about the solution you have suggested. I think it makes sense to me, but I need to make sure things still work as they did in 2.0.0 :). I'll do this now before I get pulled in another direction. On that note, since this did go out in 2.0.0, can you make this change under a different Jira issue instead of as an addendum to this one? That would remove potential confusion about 2.0.3 having it but not 2.0.1 and 2.0.2. > Avoid calling post* hook when procedure fails > - > > Key: HBASE-19953 > URL: https://issues.apache.org/jira/browse/HBASE-19953 > Project: HBase > Issue Type: Bug > Components: master, proc-v2 >Reporter: Ramesh Mani >Assignee: Josh Elser >Priority: Critical > Fix For: 2.0.0-beta-2, 2.0.0 > > Attachments: HBASE-19952.001.branch-2.patch, > HBASE-19953.002.branch-2.patch, HBASE-19953.003.branch-2.patch, > HBASE-19953.branch-2.0.addendum.patch > > > Ramesh pointed out a case where I think we're mishandling some post\* > MasterObserver hooks. Specifically, I'm looking at the deleteNamespace. > We synchronously execute the DeleteNamespace procedure. When the user > provides a namespace that isn't empty, the procedure does a rollback (which > is just a no-op), but this doesn't propagate an exception up to the > NonceProcedureRunnable in {{HMaster#deleteNamespace}}. It took Ramesh > pointing it out a bit better to me that the code executes a bit differently > than we actually expect. > I think we need to double-check our post hooks and make sure we aren't > invoking them when the procedure actually failed. cc/ [~Apache9], [~stack]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675431#comment-16675431 ] Hadoop QA commented on HBASE-21421: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 2s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 55s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 10s{color} | {color:red} hbase-server: The patch generated 1 new + 25 unchanged - 0 fixed = 26 total (was 25) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 58s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 14s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}118m 21s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}152m 30s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21421 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946919/HBASE-21421.branch-2.0.004.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 21b01692edfa 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / d4233f207d | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/14958/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14958/testReport/ | | Max. process+thread count | 3957 (vs. ulimit of 1) | | modules | C: hba
[jira] [Updated] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Resolution: Fixed Status: Resolved (was: Patch Available) Thanks for the review, Sean and Josh. > Custom WAL Provider cannot be specified by configuration whose value is > outside the enums in Providers > -- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for additional WAL Providers to be supplied - by > class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on
[ https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675398#comment-16675398 ] Hudson commented on HBASE-21395: Results for branch branch-2.0 [build #1060 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Abort split/merge procedure if there is a table procedure of the same table > going on > > > Key: HBASE-21395 > URL: https://issues.apache.org/jira/browse/HBASE-21395 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21395.branch-2.0.001.patch, > HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, > HBASE-21395.branch-2.0.004.patch > > > In my ITBLL, I often see that if split/merge procedure and table > procedure(like ModifyTableProcedure) happen at the same time, and since there > some race conditions between these two kind of procedures, causing some > serious problems. e.g. the split/merged parent is bought on line by the table > procedure or the split merged region making the whole table procedure > rollback. > Talked with [~Apache9] offline today, this kind of problem was solved in > branch-2+ since There is a fence that only one RTSP can agianst a single > region at the same time. > To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe > fence in the split/merge procedure: If there is a table procedure going on > against the same table, then abort the split/merge procedure. Aborting the > split/merge procedure at the beginning of the execution is no big deal, > compared with the mess it will cause... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675399#comment-16675399 ] Hudson commented on HBASE-21423: Results for branch branch-2.0 [build #1060 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1060//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20952) Re-visit the WAL API
[ https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675392#comment-16675392 ] Josh Elser commented on HBASE-20952: {quote} Don't see much movement on the feature branch for this. I'd like to disable the nightly test job for it until development picks up again. Please shout if you'd prefer it stay on. If so, please plan to clean up failures. {quote} "Shout". Will refresh from master. > Re-visit the WAL API > > > Key: HBASE-20952 > URL: https://issues.apache.org/jira/browse/HBASE-20952 > Project: HBase > Issue Type: Improvement > Components: wal >Reporter: Josh Elser >Priority: Major > Attachments: 20952.v1.txt > > > Take a step back from the current WAL implementations and think about what an > HBase WAL API should look like. What are the primitive calls that we require > to guarantee durability of writes with a high degree of performance? > The API needs to take the current implementations into consideration. We > should also have a mind for what is happening in the Ratis LogService (but > the LogService should not dictate what HBase's WAL API looks like RATIS-272). > Other "systems" inside of HBase that use WALs are replication and > backup&restore. Replication has the use-case for "tail"'ing the WAL which we > should provide via our new API. B&R doesn't do anything fancy (IIRC). We > should make sure all consumers are generally going to be OK with the API we > create. > The API may be "OK" (or OK in a part). We need to also consider other methods > which were "bolted" on such as {{AbstractFSWAL}} and > {{WALFileLengthProvider}}. Other corners of "WAL use" (like the > {{WALSplitter}} should also be looked at to use WAL-APIs only). > We also need to make sure that adequate interface audience and stability > annotations are chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675358#comment-16675358 ] Sean Busbey commented on HBASE-21247: - +1 nit: {code} 129 public Class getProviderClass(String key, 130 Class defaultVal) { 131 if (conf.get(key) == null) { 132 return conf.getClass(key, defaultVal, WALProvider.class); 133 } {code} I think this extra call to {{getClass}} instead of just returning {{defaultVal}} (since we already know {{key}} doesn't map to a value) is confusing, but I suppose this is _technically_ more correct in case we have some kind of special {{Configuration}} implementation that has its own opinion about how fallback to the passed default class works. > Custom WAL Provider cannot be specified by configuration whose value is > outside the enums in Providers > -- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for additional WAL Providers to be supplied - by > class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20952) Re-visit the WAL API
[ https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675317#comment-16675317 ] Sean Busbey commented on HBASE-20952: - Don't see much movement on the feature branch for this. I'd like to disable the nightly test job for it until development picks up again. Please shout if you'd prefer it stay on. If so, please plan to clean up failures. > Re-visit the WAL API > > > Key: HBASE-20952 > URL: https://issues.apache.org/jira/browse/HBASE-20952 > Project: HBase > Issue Type: Improvement > Components: wal >Reporter: Josh Elser >Priority: Major > Attachments: 20952.v1.txt > > > Take a step back from the current WAL implementations and think about what an > HBase WAL API should look like. What are the primitive calls that we require > to guarantee durability of writes with a high degree of performance? > The API needs to take the current implementations into consideration. We > should also have a mind for what is happening in the Ratis LogService (but > the LogService should not dictate what HBase's WAL API looks like RATIS-272). > Other "systems" inside of HBase that use WALs are replication and > backup&restore. Replication has the use-case for "tail"'ing the WAL which we > should provide via our new API. B&R doesn't do anything fancy (IIRC). We > should make sure all consumers are generally going to be OK with the API we > create. > The API may be "OK" (or OK in a part). We need to also consider other methods > which were "bolted" on such as {{AbstractFSWAL}} and > {{WALFileLengthProvider}}. Other corners of "WAL use" (like the > {{WALSplitter}} should also be looked at to use WAL-APIs only). > We also need to make sure that adequate interface audience and stability > annotations are chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675301#comment-16675301 ] Hadoop QA commented on HBASE-21421: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 43s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 57s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 38s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 14s{color} | {color:red} hbase-server: The patch generated 1 new + 25 unchanged - 0 fixed = 26 total (was 25) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 14s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 13s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}119m 23s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}155m 34s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21421 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946904/HBASE-21421.branch-2.0.003.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 6890e9934077 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / d4233f207d | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/14956/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14956/testReport/ | | Max. process+thread count | 4353 (vs. ulimit of 1) | | modules | C: hbase-
[jira] [Commented] (HBASE-21430) [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo
[ https://issues.apache.org/jira/browse/HBASE-21430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675265#comment-16675265 ] Sean Busbey commented on HBASE-21430: - just send them to the existing {{issues@}} list. I do think its own repo would give us more options around how we maintain compatibility with various spark release lines. I'm up for whatever folks want to do if they're doing the work though. :) > [hbase-connectors] Move hbase-spark* modules to hbase-connectors repo > - > > Key: HBASE-21430 > URL: https://issues.apache.org/jira/browse/HBASE-21430 > Project: HBase > Issue Type: Bug > Components: hbase-connectors, spark >Reporter: stack >Assignee: stack >Priority: Major > > Exploring moving the spark modules out of core hbase and into > hbase-connectors. Perhaps spark is deserving of its own repo (I think > [~busbey] was on about this) but meantime, experimenting w/ having it out in > hbase-connectors. > Here is thread on spark integration > https://lists.apache.org/thread.html/fd74ef9b9da77abf794664f06ea19c839fb3d543647fb29115081683@%3Cdev.hbase.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page
[ https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675254#comment-16675254 ] Sean Busbey commented on HBASE-21427: - related / not related I agree that nightly should be run once / day. the flaky job though loses a lot of utility if it isn't running very often. By their definition these are tests that we need multiple runs to see failures and multiple parallel runs for catch-up are a feature rather than a bug. > Make it so flakie's and nightlies build rates are configurable from jenkins > Config page > --- > > Key: HBASE-21427 > URL: https://issues.apache.org/jira/browse/HBASE-21427 > Project: HBase > Issue Type: Improvement > Components: test >Reporter: stack >Priority: Major > > Request is that rather than have to change code whenever we want to change > build rates, would be easier doing change in jenkins Config screen. > My guess is easy enough to do... . > ^^ > [~busbey] Lets chat (or I could bang my head) > See HBASE-21424 for a bit of background. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page
[ https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675250#comment-16675250 ] Sean Busbey commented on HBASE-21427: - IIRC it was not easy to do this because of how multi-branch pipelines work in jenkins. What if we moved this stuff into its own repo, something like {{hbase-ci}}? we could probably remove most of {{dev-support}}. we'd need to refactor things a bit since the job would have to manage what branches are tested itself, but we'd gain the ability to throttle ourselves project wide as opposed to now where it's per-branch. I went the per-branch route originally because it's what the jenkins docs encourage and it means we automagically have an equivalent amount of testing of new feature branches as the main branch. But I don't think folks are aware of the amount of testing done by default when they make a branch. Centralizing things would flip this around and make folks opt-in to it. We'd need to document that as a part of docs about "how to make a feature branch". > Make it so flakie's and nightlies build rates are configurable from jenkins > Config page > --- > > Key: HBASE-21427 > URL: https://issues.apache.org/jira/browse/HBASE-21427 > Project: HBase > Issue Type: Improvement > Components: test >Reporter: stack >Priority: Major > > Request is that rather than have to change code whenever we want to change > build rates, would be easier doing change in jenkins Config screen. > My guess is easy enough to do... . > ^^ > [~busbey] Lets chat (or I could bang my head) > See HBASE-21424 for a bit of background. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675249#comment-16675249 ] Hadoop QA commented on HBASE-21314: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 12s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 25s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 12s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 10m 45s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 19s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 10s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 36m 12s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946913/HBASE-21314-v1.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 19f8bdee45dc 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / c8574ba3c5 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14957/testReport/ | | Max. process+thread count | 286 (vs. ulimit of 1) | | modules | C: hbase-procedure U: hbase-procedure | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/14957/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > The i
[jira] [Comment Edited] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT
[ https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675244#comment-16675244 ] Allan Yang edited comment on HBASE-21437 at 11/5/18 2:33 PM: - Maybe we can't resubmit a WAIT_TIMEOUT procedure directly when bypassing, since it should be waken by the timeout. Let the setTimeoutFailure method turn it into RUNNABLE state. was (Author: allan163): Maybe we can't resubmit a WAIT_TIMEOUT procedure when bypassing, since it should be wakend by the timeout. Let the setTimeoutFailure method turn it into RUNNABLE state. > Bypassed procedure throw IllegalArgumentException when its state is > WAITING_TIMEOUT > --- > > Key: HBASE-21437 > URL: https://issues.apache.org/jira/browse/HBASE-21437 > Project: HBase > Issue Type: Bug >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Major > > {code} > 2018-11-05,18:25:52,735 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating > UNNATURALLY null > java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, > state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, > bypass=true; TransitRegionStateProcedure table=test_fail > over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948) > 2018-11-05,18:25:52,736 TRACE > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated. > {code} > Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state > is still WAITING_TIMEOUT, then when executor run this procedure, it will > throw exception and cause worker terminated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT
[ https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675244#comment-16675244 ] Allan Yang commented on HBASE-21437: Maybe we can't resubmit a WAIT_TIMEOUT procedure when bypassing, since it should be wakend by the timeout. Let the setTimeoutFailure method turn it into RUNNABLE state. > Bypassed procedure throw IllegalArgumentException when its state is > WAITING_TIMEOUT > --- > > Key: HBASE-21437 > URL: https://issues.apache.org/jira/browse/HBASE-21437 > Project: HBase > Issue Type: Bug >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Major > > {code} > 2018-11-05,18:25:52,735 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating > UNNATURALLY null > java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, > state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, > bypass=true; TransitRegionStateProcedure table=test_fail > over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948) > 2018-11-05,18:25:52,736 TRACE > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated. > {code} > Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state > is still WAITING_TIMEOUT, then when executor run this procedure, it will > throw exception and cause worker terminated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page
[ https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-21427: Issue Type: Improvement (was: Bug) > Make it so flakie's and nightlies build rates are configurable from jenkins > Config page > --- > > Key: HBASE-21427 > URL: https://issues.apache.org/jira/browse/HBASE-21427 > Project: HBase > Issue Type: Improvement > Components: test >Reporter: stack >Priority: Major > > Request is that rather than have to change code whenever we want to change > build rates, would be easier doing change in jenkins Config screen. > My guess is easy enough to do... . > ^^ > [~busbey] Lets chat (or I could bang my head) > See HBASE-21424 for a bit of background. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21427) Make it so flakie's and nightlies build rates are configurable from jenkins Config page
[ https://issues.apache.org/jira/browse/HBASE-21427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-21427: Component/s: test > Make it so flakie's and nightlies build rates are configurable from jenkins > Config page > --- > > Key: HBASE-21427 > URL: https://issues.apache.org/jira/browse/HBASE-21427 > Project: HBase > Issue Type: Improvement > Components: test >Reporter: stack >Priority: Major > > Request is that rather than have to change code whenever we want to change > build rates, would be easier doing change in jenkins Config screen. > My guess is easy enough to do... . > ^^ > [~busbey] Lets chat (or I could bang my head) > See HBASE-21424 for a bit of background. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21421: --- Attachment: HBASE-21421.branch-2.0.004.patch > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch, > HBASE-21421.branch-2.0.004.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21314: --- Comment: was deleted (was: Got it, +1 for the V3 patch. ) > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675200#comment-16675200 ] Allan Yang commented on HBASE-21314: Got it, +1 for the V3 patch. > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT
Jingyun Tian created HBASE-21437: Summary: Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT Key: HBASE-21437 URL: https://issues.apache.org/jira/browse/HBASE-21437 Project: HBase Issue Type: Bug Reporter: Jingyun Tian Assignee: Jingyun Tian {code} 2018-11-05,18:25:52,735 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating UNNATURALLY null java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, bypass=true; TransitRegionStateProcedure table=test_fail over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948) 2018-11-05,18:25:52,736 TRACE org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated. {code} Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state is still WAITING_TIMEOUT, then when executor run this procedure, it will throw exception and cause worker terminated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675194#comment-16675194 ] Allan Yang commented on HBASE-21314: Got it, +1 for the V3 patch. > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675176#comment-16675176 ] Duo Zhang commented on HBASE-21314: --- Add comments in isDeleted and isModified. And for the Math.abs, replied on RB: {quote} Because the BitSetNode can grow to both direction. If to left you should check with the end, while to right you should check with the start. So not a simple abs. {quote} Thanks. > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21314: -- Attachment: HBASE-21314-v1.patch > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314-v1.patch, HBASE-21314.patch, > HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21420) Use procedure event to wake up the SyncReplicationReplayWALProcedures which wait for worker
[ https://issues.apache.org/jira/browse/HBASE-21420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21420: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to master. Thanks [~zghaobac] for reviewing. > Use procedure event to wake up the SyncReplicationReplayWALProcedures which > wait for worker > --- > > Key: HBASE-21420 > URL: https://issues.apache.org/jira/browse/HBASE-21420 > Project: HBase > Issue Type: Sub-task > Components: Replication >Reporter: Guanghao Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21420-v1.patch, HBASE-21420-v2.patch, > HBASE-21420.patch > > > Now if a SyncReplicationReplayWALProcedure failed to get a worker, it will > sleep backoff and retry. So when the finished > SyncReplicationReplayWALProcedure release a new worker, it will take a long > time to run and get the worker to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675156#comment-16675156 ] Duo Zhang commented on HBASE-21421: --- {code} LOG.warn("Failed to checkOnlineRegionsReport, maybe due to network log, " {code} 'log' to 'lag'. And please remove the empty '@throw Exception', it will cause a checkstyle warning I think. No other problem. +1 after you fix these issues. > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21420) Use procedure event to wake up the SyncReplicationReplayWALProcedures which wait for worker
[ https://issues.apache.org/jira/browse/HBASE-21420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675120#comment-16675120 ] Hadoop QA commented on HBASE-21420: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 28s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 31s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 21s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 19s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 11m 37s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 38s{color} | {color:green} hbase-protocol-shaded in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s{color} | {color:green} hbase-replication in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 40s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green}162m 1s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}229m 33s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21420 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946885/HBASE-21420-v2
[jira] [Commented] (HBASE-21314) The implementation of BitSetNode is not efficient
[ https://issues.apache.org/jira/browse/HBASE-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675100#comment-16675100 ] Allan Yang commented on HBASE-21314: {code} -return Math.abs(procId - start) < MAX_NODE_SIZE; {code} Why not use abs? Can you add a comment in isDeleted and isModified here: {code} modified[wordIndex] & (1L << bitmapIndex) {code} We all are confused about the left shift before( which can more than 64 times) The patch looks good to me. > The implementation of BitSetNode is not efficient > - > > Key: HBASE-21314 > URL: https://issues.apache.org/jira/browse/HBASE-21314 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21314.patch, HBASE-21314.patch > > > As the MAX_NODE_SIZE is the same with BITS_PER_WORD, which means that we > could only have one word(long) for each BitSetNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675087#comment-16675087 ] Allan Yang commented on HBASE-21421: [~Apache9], uploaded a V2 to address your advice, thanks. > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21421: --- Attachment: HBASE-21421.branch-2.0.003.patch > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch, HBASE-21421.branch-2.0.003.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675082#comment-16675082 ] Allan Yang commented on HBASE-21423: Pushed to branch-2.1 and branch-2.0. Thanks for reviewing, [~Apache9]. > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21423: --- Resolution: Fixed Fix Version/s: 2.1.2 2.0.3 Status: Resolved (was: Patch Available) > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21255) [acl] Refactor TablePermission into three classes (Global, Namespace, Table)
[ https://issues.apache.org/jira/browse/HBASE-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675051#comment-16675051 ] Hadoop QA commented on HBASE-21255: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 8 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 20s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 15s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 7s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 24s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} The patch passed checkstyle in hbase-common {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 33s{color} | {color:red} hbase-client: The patch generated 6 new + 79 unchanged - 30 fixed = 85 total (was 109) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} hbase-server: The patch generated 0 new + 101 unchanged - 53 fixed = 101 total (was 154) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s{color} | {color:green} The patch passed checkstyle in hbase-rsgroup {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 14s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 11m 4s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 31s{color} | {color:green} hbase-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 8s{color} | {color:green} hbase-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green}150m 16s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 26s{color} | {color:green} hbase-rsgroup in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color}
[jira] [Commented] (HBASE-21421) Do not kill RS if reportOnlineRegions fails
[ https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675050#comment-16675050 ] Duo Zhang commented on HBASE-21421: --- Please log the full stack trace instead of e.getMessage(not your fault), and let's remove the code instead of commenting out? > Do not kill RS if reportOnlineRegions fails > --- > > Key: HBASE-21421 > URL: https://issues.apache.org/jira/browse/HBASE-21421 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21421.branch-2.0.001.patch, > HBASE-21421.branch-2.0.002.patch > > > In the periodic regionServerReport from RS to master, we will call > master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a > same state with Master. If RS holds a region which master think should be on > another RS, the Master will kill the RS. > But, the regionServerReport could be lagging(due to network or something), > which can't represent the current state of RegionServer. Besides, we will > call reportRegionStateTransition and try forever until it successfully > reported to master when online a region. We can count on > reportRegionStateTransition calls. > I have encountered cases that the regions are closed on the RS and > reportRegionStateTransition to master successfully. But later, a lagging > regionServerReport tells the master the region is online on the RS(Which is > not at the moment, this call may generated some time ago and delayed by > network somehow), the the master think the region should be on another RS, > and kill the RS, which should not be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on
[ https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675047#comment-16675047 ] Allan Yang commented on HBASE-21395: Pushed to branch-2.0 and branch-2.1, thanks for reviewing, [~stack],[~xucang]. > Abort split/merge procedure if there is a table procedure of the same table > going on > > > Key: HBASE-21395 > URL: https://issues.apache.org/jira/browse/HBASE-21395 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21395.branch-2.0.001.patch, > HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, > HBASE-21395.branch-2.0.004.patch > > > In my ITBLL, I often see that if split/merge procedure and table > procedure(like ModifyTableProcedure) happen at the same time, and since there > some race conditions between these two kind of procedures, causing some > serious problems. e.g. the split/merged parent is bought on line by the table > procedure or the split merged region making the whole table procedure > rollback. > Talked with [~Apache9] offline today, this kind of problem was solved in > branch-2+ since There is a fence that only one RTSP can agianst a single > region at the same time. > To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe > fence in the split/merge procedure: If there is a table procedure going on > against the same table, then abort the split/merge procedure. Aborting the > split/merge procedure at the beginning of the execution is no big deal, > compared with the mess it will cause... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on
[ https://issues.apache.org/jira/browse/HBASE-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21395: --- Resolution: Fixed Status: Resolved (was: Patch Available) > Abort split/merge procedure if there is a table procedure of the same table > going on > > > Key: HBASE-21395 > URL: https://issues.apache.org/jira/browse/HBASE-21395 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21395.branch-2.0.001.patch, > HBASE-21395.branch-2.0.002.patch, HBASE-21395.branch-2.0.003.patch, > HBASE-21395.branch-2.0.004.patch > > > In my ITBLL, I often see that if split/merge procedure and table > procedure(like ModifyTableProcedure) happen at the same time, and since there > some race conditions between these two kind of procedures, causing some > serious problems. e.g. the split/merged parent is bought on line by the table > procedure or the split merged region making the whole table procedure > rollback. > Talked with [~Apache9] offline today, this kind of problem was solved in > branch-2+ since There is a fence that only one RTSP can agianst a single > region at the same time. > To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe > fence in the split/merge procedure: If there is a table procedure going on > against the same table, then abort the split/merge procedure. Aborting the > split/merge procedure at the beginning of the execution is no big deal, > compared with the mess it will cause... -- This message was sent by Atlassian JIRA (v7.6.3#76005)