[jira] [Commented] (HBASE-21568) Disable use of BlockCache for LoadIncrementalHFiles
[ https://issues.apache.org/jira/browse/HBASE-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714351#comment-16714351 ] Guanghao Zhang commented on HBASE-21568: {quote}bq. Are you eventually planning a backport of that change to 2.x? {quote} Yes, I plan to commit it to all branch-2.0+. But it may need more time... So feel free to commit this :) > Disable use of BlockCache for LoadIncrementalHFiles > --- > > Key: HBASE-21568 > URL: https://issues.apache.org/jira/browse/HBASE-21568 > Project: HBase > Issue Type: Bug > Components: Client >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Major > Fix For: 2.2.0, 2.1.2, 2.0.4 > > Attachments: HBASE-21568.001.branch-2.0.patch > > > [~vrodionov] added some API to {{CacheConfig}} via HBASE-17151 to allow > callers to specify that they do not want to use a block cache when reading an > HFile. > If the BucketCache is set up to use the FileSystem, we can have a situation > where the client tries to instantiate the BucketCache and is disallowed due > to filesystem permissions: > {code:java} > 2018-12-03 16:22:03,032 ERROR [LoadIncrementalHFiles-0] bucket.FileIOEngine: > Failed allocating cache on /mnt/hbase/cache.data > java.io.FileNotFoundException: /mnt/hbase/cache.data (Permission denied) > at java.io.RandomAccessFile.open0(Native Method) > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > at java.io.RandomAccessFile.(RandomAccessFile.java:243) > at java.io.RandomAccessFile.(RandomAccessFile.java:124) > at > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.(FileIOEngine.java:81) > at > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:382) > at > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.(BucketCache.java:262) > at > org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:633) > at > org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:663) > at org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:250) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:713) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:621) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:617) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > LoadIncrementalHfiles should provide the {{CacheConfig.DISABLE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server
[ https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714343#comment-16714343 ] Jingyun Tian commented on HBASE-21565: -- [~Apache9] I Ported patch from HBASE-20976 and Set the holdLock to true. Can you help check this out? > Delete dead server from dead server list too early leads to concurrent Server > Crash Procedures(SCP) for a same server > - > > Key: HBASE-21565 > URL: https://issues.apache.org/jira/browse/HBASE-21565 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Critical > Attachments: HBASE-21565.master.001.patch, > HBASE-21565.master.002.patch > > > There are 2 kinds of SCP for a same server will be scheduled during cluster > restart, one is ZK session timeout, the other one is new server report in > will cause the stale one do fail over. The only barrier for these 2 kinds of > SCP is check if the server is in the dead server list. > {code} > if (this.deadservers.isDeadServer(serverName)) { > LOG.warn("Expiration called on {} but crash processing already in > progress", serverName); > return false; > } > {code} > But the problem is when master finish initialization, it will delete all > stale servers from dead server list. Thus when the SCP for ZK session timeout > come in, the barrier is already removed. > Here is the logs that how this problem occur. > {code} > 2018-12-07,11:42:37,589 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > 2018-12-07,11:42:58,007 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > {code} > Now we can see two SCP are scheduled for the same server. > But the first procedure is finished after the second SCP starts. > {code} > 2018-12-07,11:43:08,038 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, > state=SUCCESS, hasLock=false; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > in 30.5340sec > {code} > Thus it will leads the problem that regions will be assigned twice. > {code} > 2018-12-07,12:16:33,039 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, > location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, > region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on > server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise > {code} > And here we can see the server is removed from dead server list before the > second SCP starts. > {code} > 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3 > {code} > Thus we should not delete dead server from dead server list immediately. > Patch to fix this problem will be upload later. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server
[ https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jingyun Tian updated HBASE-21565: - Affects Version/s: 3.0.0 > Delete dead server from dead server list too early leads to concurrent Server > Crash Procedures(SCP) for a same server > - > > Key: HBASE-21565 > URL: https://issues.apache.org/jira/browse/HBASE-21565 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Critical > Attachments: HBASE-21565.master.001.patch, > HBASE-21565.master.002.patch > > > There are 2 kinds of SCP for a same server will be scheduled during cluster > restart, one is ZK session timeout, the other one is new server report in > will cause the stale one do fail over. The only barrier for these 2 kinds of > SCP is check if the server is in the dead server list. > {code} > if (this.deadservers.isDeadServer(serverName)) { > LOG.warn("Expiration called on {} but crash processing already in > progress", serverName); > return false; > } > {code} > But the problem is when master finish initialization, it will delete all > stale servers from dead server list. Thus when the SCP for ZK session timeout > come in, the barrier is already removed. > Here is the logs that how this problem occur. > {code} > 2018-12-07,11:42:37,589 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > 2018-12-07,11:42:58,007 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > {code} > Now we can see two SCP are scheduled for the same server. > But the first procedure is finished after the second SCP starts. > {code} > 2018-12-07,11:43:08,038 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, > state=SUCCESS, hasLock=false; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > in 30.5340sec > {code} > Thus it will leads the problem that regions will be assigned twice. > {code} > 2018-12-07,12:16:33,039 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, > location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, > region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on > server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise > {code} > And here we can see the server is removed from dead server list before the > second SCP starts. > {code} > 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3 > {code} > Thus we should not delete dead server from dead server list immediately. > Patch to fix this problem will be upload later. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server
[ https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714324#comment-16714324 ] Hadoop QA commented on HBASE-21565: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 48s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 43s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 15s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}126m 13s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}161m 57s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21565 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12951150/HBASE-21565.master.002.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 6abc41e3b4b5 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 79d90c87b5 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15229/testReport/ | | Max. process+thread count | 5031 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/15229/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Delete dead server
[jira] [Updated] (HBASE-21572) The "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI.
[ https://issues.apache.org/jira/browse/HBASE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lixiaobao updated HBASE-21572: -- Attachment: HBASE-21572.patch Status: Patch Available (was: Open) > The "progress" object in "Compactor" is not thread-safe, this may cause the > misleading progress information on the web UI. > --- > > Key: HBASE-21572 > URL: https://issues.apache.org/jira/browse/HBASE-21572 > Project: HBase > Issue Type: Bug > Components: Compaction >Affects Versions: 2.0.0, 2.1.0, 1.4.0, 1.3.0, 1.2.0, 3.0.0 >Reporter: lixiaobao >Assignee: lixiaobao >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21572.patch > > > when setting the compaction thread number more than 1, on the store, there > may be multiple threads on the region server using "compactor" of the "store" > to execute the compaction . However, the "progress" object in "Compactor" is > not thread-safe, this may cause the misleading progress information on the > web UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21572) The "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI.
lixiaobao created HBASE-21572: - Summary: The "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI. Key: HBASE-21572 URL: https://issues.apache.org/jira/browse/HBASE-21572 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 2.0.0, 2.1.0, 1.4.0, 1.3.0, 1.2.0, 3.0.0 Reporter: lixiaobao Assignee: lixiaobao Fix For: 3.0.0 when setting the compaction thread number more than 1, on the store, there may be multiple threads on the region server using "compactor" of the "store" to execute the compaction . However, the "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21568) Disable use of BlockCache for LoadIncrementalHFiles
[ https://issues.apache.org/jira/browse/HBASE-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714257#comment-16714257 ] Josh Elser commented on HBASE-21568: Thanks for the ping, [~zghaobac]! Only looking at the description of 21514 – I would imagine so :). Are you eventually planning a backport of that change to 2.x? If not I can just think of this as a stop-gap for your better fixing. It is silly that a client operation would even think about instantiating a block cache ;) > Disable use of BlockCache for LoadIncrementalHFiles > --- > > Key: HBASE-21568 > URL: https://issues.apache.org/jira/browse/HBASE-21568 > Project: HBase > Issue Type: Bug > Components: Client >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Major > Fix For: 2.2.0, 2.1.2, 2.0.4 > > Attachments: HBASE-21568.001.branch-2.0.patch > > > [~vrodionov] added some API to {{CacheConfig}} via HBASE-17151 to allow > callers to specify that they do not want to use a block cache when reading an > HFile. > If the BucketCache is set up to use the FileSystem, we can have a situation > where the client tries to instantiate the BucketCache and is disallowed due > to filesystem permissions: > {code:java} > 2018-12-03 16:22:03,032 ERROR [LoadIncrementalHFiles-0] bucket.FileIOEngine: > Failed allocating cache on /mnt/hbase/cache.data > java.io.FileNotFoundException: /mnt/hbase/cache.data (Permission denied) > at java.io.RandomAccessFile.open0(Native Method) > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > at java.io.RandomAccessFile.(RandomAccessFile.java:243) > at java.io.RandomAccessFile.(RandomAccessFile.java:124) > at > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.(FileIOEngine.java:81) > at > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:382) > at > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.(BucketCache.java:262) > at > org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:633) > at > org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:663) > at org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:250) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:713) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:621) > at > org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:617) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > LoadIncrementalHfiles should provide the {{CacheConfig.DISABLE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server
[ https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714258#comment-16714258 ] Jingyun Tian commented on HBASE-21565: -- [~allan163] Thanks for your comment. I ported your patch to this one. And also I think [~Apache9] Duo's opinion is reasonable, we can set holdLock to true to prevent multiple SCPs for a same server from running concurrently. > Delete dead server from dead server list too early leads to concurrent Server > Crash Procedures(SCP) for a same server > - > > Key: HBASE-21565 > URL: https://issues.apache.org/jira/browse/HBASE-21565 > Project: HBase > Issue Type: Bug >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Critical > Attachments: HBASE-21565.master.001.patch, > HBASE-21565.master.002.patch > > > There are 2 kinds of SCP for a same server will be scheduled during cluster > restart, one is ZK session timeout, the other one is new server report in > will cause the stale one do fail over. The only barrier for these 2 kinds of > SCP is check if the server is in the dead server list. > {code} > if (this.deadservers.isDeadServer(serverName)) { > LOG.warn("Expiration called on {} but crash processing already in > progress", serverName); > return false; > } > {code} > But the problem is when master finish initialization, it will delete all > stale servers from dead server list. Thus when the SCP for ZK session timeout > come in, the barrier is already removed. > Here is the logs that how this problem occur. > {code} > 2018-12-07,11:42:37,589 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > 2018-12-07,11:42:58,007 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > {code} > Now we can see two SCP are scheduled for the same server. > But the first procedure is finished after the second SCP starts. > {code} > 2018-12-07,11:43:08,038 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, > state=SUCCESS, hasLock=false; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > in 30.5340sec > {code} > Thus it will leads the problem that regions will be assigned twice. > {code} > 2018-12-07,12:16:33,039 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, > location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, > region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on > server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise > {code} > And here we can see the server is removed from dead server list before the > second SCP starts. > {code} > 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3 > {code} > Thus we should not delete dead server from dead server list immediately. > Patch to fix this problem will be upload later. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server
[ https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jingyun Tian updated HBASE-21565: - Attachment: HBASE-21565.master.002.patch > Delete dead server from dead server list too early leads to concurrent Server > Crash Procedures(SCP) for a same server > - > > Key: HBASE-21565 > URL: https://issues.apache.org/jira/browse/HBASE-21565 > Project: HBase > Issue Type: Bug >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Critical > Attachments: HBASE-21565.master.001.patch, > HBASE-21565.master.002.patch > > > There are 2 kinds of SCP for a same server will be scheduled during cluster > restart, one is ZK session timeout, the other one is new server report in > will cause the stale one do fail over. The only barrier for these 2 kinds of > SCP is check if the server is in the dead server list. > {code} > if (this.deadservers.isDeadServer(serverName)) { > LOG.warn("Expiration called on {} but crash processing already in > progress", serverName); > return false; > } > {code} > But the problem is when master finish initialization, it will delete all > stale servers from dead server list. Thus when the SCP for ZK session timeout > come in, the barrier is already removed. > Here is the logs that how this problem occur. > {code} > 2018-12-07,11:42:37,589 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > 2018-12-07,11:42:58,007 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, > state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > {code} > Now we can see two SCP are scheduled for the same server. > But the first procedure is finished after the second SCP starts. > {code} > 2018-12-07,11:43:08,038 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, > state=SUCCESS, hasLock=false; ServerCrashProcedure > server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false > in 30.5340sec > {code} > Thus it will leads the problem that regions will be assigned twice. > {code} > 2018-12-07,12:16:33,039 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, > location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, > region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on > server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise > {code} > And here we can see the server is removed from dead server list before the > second SCP starts. > {code} > 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3 > {code} > Thus we should not delete dead server from dead server list immediately. > Patch to fix this problem will be upload later. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714223#comment-16714223 ] Reid Chan commented on HBASE-21246: --- Thanks for the ping [~an...@apache.org]. I need some time, will be back. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, > 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.43.txt, > 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > HBASE-21246.master.001.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21571) The TestTableSnapshotInputFormat is flaky
[ https://issues.apache.org/jira/browse/HBASE-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Hu updated HBASE-21571: - Attachment: jenkins.tar.gz > The TestTableSnapshotInputFormat is flaky > - > > Key: HBASE-21571 > URL: https://issues.apache.org/jira/browse/HBASE-21571 > Project: HBase > Issue Type: Bug >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Attachments: jenkins.tar.gz > > > see: > https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.0/lastSuccessfulBuild/artifact/dashboard.html > RS aborted because : > {code} > 2018-12-09 00:34:18,635 WARN > [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=34270] > master.MasterRpcServices(514): asf905.gq1.ygridcore.net,32908,1544315637411 > reported a fatal error: > * ABORTING region server asf905.gq1.ygridcore.net,32908,1544315637411: > Unrecoverable exception while closing region > testWithMapReduce,,1544315644043.97dff0ec285658ab3f73d5ca42a97b6e., still > finishing close * > Cause: > java.io.IOException: The new max sequence id 6 is less than the old max > sequence id 7 > at > org.apache.hadoop.hbase.wal.WALSplitter.writeRegionSequenceIdFile(WALSplitter.java:684) > at > org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1134) > at > org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1662) > at > org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1479) > at > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21571) The TestTableSnapshotInputFormat is flaky
Zheng Hu created HBASE-21571: Summary: The TestTableSnapshotInputFormat is flaky Key: HBASE-21571 URL: https://issues.apache.org/jira/browse/HBASE-21571 Project: HBase Issue Type: Bug Reporter: Zheng Hu Assignee: Zheng Hu see: https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.0/lastSuccessfulBuild/artifact/dashboard.html RS aborted because : {code} 2018-12-09 00:34:18,635 WARN [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=34270] master.MasterRpcServices(514): asf905.gq1.ygridcore.net,32908,1544315637411 reported a fatal error: * ABORTING region server asf905.gq1.ygridcore.net,32908,1544315637411: Unrecoverable exception while closing region testWithMapReduce,,1544315644043.97dff0ec285658ab3f73d5ca42a97b6e., still finishing close * Cause: java.io.IOException: The new max sequence id 6 is less than the old max sequence id 7 at org.apache.hadoop.hbase.wal.WALSplitter.writeRegionSequenceIdFile(WALSplitter.java:684) at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1134) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1662) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1479) at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20755) quickstart note about Web UI port changes in ref guide is rendered incorrectly
[ https://issues.apache.org/jira/browse/HBASE-20755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Somogyi updated HBASE-20755: -- Labels: beginner (was: ) > quickstart note about Web UI port changes in ref guide is rendered incorrectly > -- > > Key: HBASE-20755 > URL: https://issues.apache.org/jira/browse/HBASE-20755 > Project: HBase > Issue Type: Bug > Components: documentation >Reporter: Sean Busbey >Priority: Minor > Labels: beginner > Attachments: Untitled.png > > > The note in the quickstart guide about how the web ui ports changed only > renders the title as a note. the text is just a normal paragraph afterwards. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20756) reference guide examples still contain references to EOM versions
[ https://issues.apache.org/jira/browse/HBASE-20756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Somogyi updated HBASE-20756: -- Labels: beginner (was: ) > reference guide examples still contain references to EOM versions > - > > Key: HBASE-20756 > URL: https://issues.apache.org/jira/browse/HBASE-20756 > Project: HBase > Issue Type: Bug > Components: community, documentation >Reporter: Sean Busbey >Priority: Minor > Labels: beginner > > the reference guide still has examples that refer to EOM versions. e.g. this > shell output that has 0.98 in it: > {code} > $ echo "describe 'test1'" | ./hbase shell -n > Version 0.98.3-hadoop2, rd5e65a9144e315bb0a964e7730871af32f5018d5, Sat May 31 > 19:56:09 PDT 2014 > describe 'test1' > DESCRIPTION ENABLED > 'test1', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NON true > E', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', > VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIO > NS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false' > , BLOCKCACHE => 'true'} > 1 row(s) in 3.2410 seconds > {code} > these should be redone with a current release. Ideally a version in the minor > release line the docs are for, but even just updating to the stable pointer > would be a big improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-20754) quickstart guide should instruct folks to set JAVA_HOME to a JDK installation.
[ https://issues.apache.org/jira/browse/HBASE-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Somogyi updated HBASE-20754: -- Labels: beginner (was: ) > quickstart guide should instruct folks to set JAVA_HOME to a JDK installation. > -- > > Key: HBASE-20754 > URL: https://issues.apache.org/jira/browse/HBASE-20754 > Project: HBase > Issue Type: Bug > Components: documentation >Reporter: Sean Busbey >Priority: Major > Labels: beginner > > The quickstart guide currently instructs folks to set JAVA_HOME, but to the > wrong place > {code} > The JAVA_HOME variable should be set to a directory which contains the > executable file bin/java. Most modern Linux operating systems provide a > mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently > switching between versions of executables such as Java. In this case, you can > set JAVA_HOME to the directory containing the symbolic link to bin/java, > which is usually /usr. > JAVA_HOME=/usr > {code} > instead, it should tell folks to point it to a jdk installation and help them > on how to find that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21512) Introduce an AsyncClusterConnection and replace the usage of ClusterConnection
[ https://issues.apache.org/jira/browse/HBASE-21512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713981#comment-16713981 ] Hudson commented on HBASE-21512: Results for branch HBASE-21512 [build #11 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Introduce an AsyncClusterConnection and replace the usage of ClusterConnection > -- > > Key: HBASE-21512 > URL: https://issues.apache.org/jira/browse/HBASE-21512 > Project: HBase > Issue Type: Umbrella >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > > At least for the RSProcedureDispatcher, with CompletableFuture we do not need > to set a delay and use a thread pool any more, which could reduce the > resource usage and also the latency. > Once this is done, I think we can remove the ClusterConnection completely, > and start to rewrite the old sync client based on the async client, which > could reduce the code base a lot for our client. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21487) Concurrent modify table ops can lead to unexpected results
[ https://issues.apache.org/jira/browse/HBASE-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713979#comment-16713979 ] Allan Yang commented on HBASE-21487: {quote} May be we can pass the old_table_descriptor also in ModifyTableProcedure. In MODIFY_TABLE_PREPARE step, we can compare the old_table_descriptor with current_table_descriptor, if they are not same then we can throw exception ... {quote} I think you are right, it is better to compare table descriptors directly. > Concurrent modify table ops can lead to unexpected results > -- > > Key: HBASE-21487 > URL: https://issues.apache.org/jira/browse/HBASE-21487 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0, 2.0.0 >Reporter: Syeda Arshiya Tabreen >Priority: Major > > Concurrent modifyTable or add/delete/modify columnFamily leads to incorrect > result. After HBASE-18893, The behavior of add/delete/modify column family > during concurrent operation is changed compare to branch-1.When one client > is adding cf2 and another one cf3 .. In branch-1 final result will be > cf1,cf2,cf3 but now either cf1,cf2 OR cf1,cf3 will be the outcome depending > on which ModifyTableProcedure executed finally.Its because new table > descriptor is constructed before submitting the ModifyTableProcedure in > HMaster class and its not guarded by any lock. > *Steps to reproduce* > 1.Create table 't' with column family 'f1' > 2.Client-1 and Client-2 requests to add column family 'f2' and 'f3' on table > 't' concurrently. > *Expected Result* > Table should have three column families(f1,f2,f3) > *Actual Result* > Table 't' will have column family either (f1,f2) or (f1,f3) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21487) Concurrent modify table ops can lead to unexpected results
[ https://issues.apache.org/jira/browse/HBASE-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713947#comment-16713947 ] Syeda Arshiya Tabreen commented on HBASE-21487: --- [~allan163], any thoughts on the issue? > Concurrent modify table ops can lead to unexpected results > -- > > Key: HBASE-21487 > URL: https://issues.apache.org/jira/browse/HBASE-21487 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0, 2.0.0 >Reporter: Syeda Arshiya Tabreen >Priority: Major > > Concurrent modifyTable or add/delete/modify columnFamily leads to incorrect > result. After HBASE-18893, The behavior of add/delete/modify column family > during concurrent operation is changed compare to branch-1.When one client > is adding cf2 and another one cf3 .. In branch-1 final result will be > cf1,cf2,cf3 but now either cf1,cf2 OR cf1,cf3 will be the outcome depending > on which ModifyTableProcedure executed finally.Its because new table > descriptor is constructed before submitting the ModifyTableProcedure in > HMaster class and its not guarded by any lock. > *Steps to reproduce* > 1.Create table 't' with column family 'f1' > 2.Client-1 and Client-2 requests to add column family 'f2' and 'f3' on table > 't' concurrently. > *Expected Result* > Table should have three column families(f1,f2,f3) > *Actual Result* > Table 't' will have column family either (f1,f2) or (f1,f3) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21570) Add write buffer periodic flush support for AsyncBufferedMutator
Duo Zhang created HBASE-21570: - Summary: Add write buffer periodic flush support for AsyncBufferedMutator Key: HBASE-21570 URL: https://issues.apache.org/jira/browse/HBASE-21570 Project: HBase Issue Type: Sub-task Components: asyncclient, Client Reporter: Duo Zhang Align with the BufferedMutator interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21567) Allow overriding configs starting up the shell
[ https://issues.apache.org/jira/browse/HBASE-21567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713916#comment-16713916 ] Peter Somogyi commented on HBASE-21567: --- Ok, looks like rubocop is complaining for almost everything. Let's have this patch like this. +1 > Allow overriding configs starting up the shell > -- > > Key: HBASE-21567 > URL: https://issues.apache.org/jira/browse/HBASE-21567 > Project: HBase > Issue Type: Improvement > Components: shell >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3 > > Attachments: HBASE-21567.master.001.patch, > HBASE-21567.master.002.patch, HBASE-21567.master.003.patch > > > Needed to be able to point a local install at a remote cluster. I wanted to > be able to do this: > ${HBASE_HOME}/bin/hbase shell > -Dhbase.zookeeper.quorum=ZK0.remote.cluster.example.org,ZK1.remote.cluster.example.org,ZK2.remote.cluster.example.org -- This message was sent by Atlassian JIRA (v7.6.3#76005)