[jira] [Commented] (HBASE-21568) Disable use of BlockCache for LoadIncrementalHFiles

2018-12-09 Thread Guanghao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714351#comment-16714351
 ] 

Guanghao Zhang commented on HBASE-21568:


{quote}bq. Are you eventually planning a backport of that change to 2.x?
{quote}
Yes, I plan to commit it to all branch-2.0+. But it may need more time... So 
feel free to commit this :)

> Disable use of BlockCache for LoadIncrementalHFiles
> ---
>
> Key: HBASE-21568
> URL: https://issues.apache.org/jira/browse/HBASE-21568
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: 2.2.0, 2.1.2, 2.0.4
>
> Attachments: HBASE-21568.001.branch-2.0.patch
>
>
> [~vrodionov] added some API to {{CacheConfig}} via HBASE-17151 to allow 
> callers to specify that they do not want to use a block cache when reading an 
> HFile.
> If the BucketCache is set up to use the FileSystem, we can have a situation 
> where the client tries to instantiate the BucketCache and is disallowed due 
> to filesystem permissions:
> {code:java}
> 2018-12-03 16:22:03,032 ERROR [LoadIncrementalHFiles-0] bucket.FileIOEngine: 
> Failed allocating cache on /mnt/hbase/cache.data
> java.io.FileNotFoundException: /mnt/hbase/cache.data (Permission denied)
>   at java.io.RandomAccessFile.open0(Native Method)
>   at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
>   at java.io.RandomAccessFile.(RandomAccessFile.java:243)
>   at java.io.RandomAccessFile.(RandomAccessFile.java:124)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.(FileIOEngine.java:81)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:382)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.(BucketCache.java:262)
>   at 
> org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:633)
>   at 
> org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:663)
>   at org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:250)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:713)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:621)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:617)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> LoadIncrementalHfiles should provide the {{CacheConfig.DISABLE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server

2018-12-09 Thread Jingyun Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714343#comment-16714343
 ] 

Jingyun Tian commented on HBASE-21565:
--

[~Apache9] I Ported patch from HBASE-20976 and Set the holdLock to true. Can 
you help check this out?

> Delete dead server from dead server list too early leads to concurrent Server 
> Crash Procedures(SCP) for a same server
> -
>
> Key: HBASE-21565
> URL: https://issues.apache.org/jira/browse/HBASE-21565
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Critical
> Attachments: HBASE-21565.master.001.patch, 
> HBASE-21565.master.002.patch
>
>
> There are 2 kinds of SCP for a same server will be scheduled during cluster 
> restart, one is ZK session timeout, the other one is new server report in 
> will cause the stale one do fail over. The only barrier for these 2 kinds of 
> SCP is check if the server is in the dead server list.
> {code}
> if (this.deadservers.isDeadServer(serverName)) {
>   LOG.warn("Expiration called on {} but crash processing already in 
> progress", serverName);
>   return false;
> }
> {code}
> But the problem is when master finish initialization, it will delete all 
> stale servers from dead server list. Thus when the SCP for ZK session timeout 
> come in, the barrier is already removed.
> Here is the logs that how this problem occur.
> {code}
> 2018-12-07,11:42:37,589 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> 2018-12-07,11:42:58,007 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> {code}
> Now we can see two SCP are scheduled for the same server.
> But the first procedure is finished after the second SCP starts.
> {code}
> 2018-12-07,11:43:08,038 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, 
> state=SUCCESS, hasLock=false; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false 
> in 30.5340sec
> {code}
> Thus it will leads the problem that regions will be assigned twice.
> {code}
> 2018-12-07,12:16:33,039 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, 
> location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, 
> region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on 
> server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
> {code}
> And here we can see the server is removed from dead server list before the 
> second SCP starts.
> {code}
> 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
> {code}
> Thus we should not delete dead server from dead server list immediately.
> Patch to fix this problem will be upload later.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server

2018-12-09 Thread Jingyun Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingyun Tian updated HBASE-21565:
-
Affects Version/s: 3.0.0

> Delete dead server from dead server list too early leads to concurrent Server 
> Crash Procedures(SCP) for a same server
> -
>
> Key: HBASE-21565
> URL: https://issues.apache.org/jira/browse/HBASE-21565
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Critical
> Attachments: HBASE-21565.master.001.patch, 
> HBASE-21565.master.002.patch
>
>
> There are 2 kinds of SCP for a same server will be scheduled during cluster 
> restart, one is ZK session timeout, the other one is new server report in 
> will cause the stale one do fail over. The only barrier for these 2 kinds of 
> SCP is check if the server is in the dead server list.
> {code}
> if (this.deadservers.isDeadServer(serverName)) {
>   LOG.warn("Expiration called on {} but crash processing already in 
> progress", serverName);
>   return false;
> }
> {code}
> But the problem is when master finish initialization, it will delete all 
> stale servers from dead server list. Thus when the SCP for ZK session timeout 
> come in, the barrier is already removed.
> Here is the logs that how this problem occur.
> {code}
> 2018-12-07,11:42:37,589 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> 2018-12-07,11:42:58,007 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> {code}
> Now we can see two SCP are scheduled for the same server.
> But the first procedure is finished after the second SCP starts.
> {code}
> 2018-12-07,11:43:08,038 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, 
> state=SUCCESS, hasLock=false; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false 
> in 30.5340sec
> {code}
> Thus it will leads the problem that regions will be assigned twice.
> {code}
> 2018-12-07,12:16:33,039 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, 
> location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, 
> region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on 
> server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
> {code}
> And here we can see the server is removed from dead server list before the 
> second SCP starts.
> {code}
> 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
> {code}
> Thus we should not delete dead server from dead server list immediately.
> Patch to fix this problem will be upload later.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server

2018-12-09 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714324#comment-16714324
 ] 

Hadoop QA commented on HBASE-21565:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 9s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
48s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
43s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 15s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}126m 
13s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}161m 57s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21565 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12951150/HBASE-21565.master.002.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 6abc41e3b4b5 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 79d90c87b5 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15229/testReport/ |
| Max. process+thread count | 5031 (vs. ulimit of 1) |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15229/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Delete dead server 

[jira] [Updated] (HBASE-21572) The "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI.

2018-12-09 Thread lixiaobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lixiaobao updated HBASE-21572:
--
Attachment: HBASE-21572.patch
Status: Patch Available  (was: Open)

>  The "progress" object in "Compactor" is not thread-safe, this may cause the 
> misleading progress information on the web UI.
> ---
>
> Key: HBASE-21572
> URL: https://issues.apache.org/jira/browse/HBASE-21572
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 2.0.0, 2.1.0, 1.4.0, 1.3.0, 1.2.0, 3.0.0
>Reporter: lixiaobao
>Assignee: lixiaobao
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21572.patch
>
>
> when setting the compaction thread number more than 1, on the store, there 
> may be multiple threads on the region server using "compactor" of the "store" 
> to execute the compaction . However, the "progress" object in "Compactor" is 
> not thread-safe, this may cause the misleading progress information on the 
> web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21572) The "progress" object in "Compactor" is not thread-safe, this may cause the misleading progress information on the web UI.

2018-12-09 Thread lixiaobao (JIRA)
lixiaobao created HBASE-21572:
-

 Summary:  The "progress" object in "Compactor" is not thread-safe, 
this may cause the misleading progress information on the web UI.
 Key: HBASE-21572
 URL: https://issues.apache.org/jira/browse/HBASE-21572
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 2.0.0, 2.1.0, 1.4.0, 1.3.0, 1.2.0, 3.0.0
Reporter: lixiaobao
Assignee: lixiaobao
 Fix For: 3.0.0


when setting the compaction thread number more than 1, on the store, there may 
be multiple threads on the region server using "compactor" of the "store" to 
execute the compaction . However, the "progress" object in "Compactor" is not 
thread-safe, this may cause the misleading progress information on the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21568) Disable use of BlockCache for LoadIncrementalHFiles

2018-12-09 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714257#comment-16714257
 ] 

Josh Elser commented on HBASE-21568:


Thanks for the ping, [~zghaobac]!

Only looking at the description of 21514 – I would imagine so :). Are you 
eventually planning a backport of that change to 2.x? If not I can just think 
of this as a stop-gap for your better fixing. It is silly that a client 
operation would even think about instantiating a block cache ;)

> Disable use of BlockCache for LoadIncrementalHFiles
> ---
>
> Key: HBASE-21568
> URL: https://issues.apache.org/jira/browse/HBASE-21568
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: 2.2.0, 2.1.2, 2.0.4
>
> Attachments: HBASE-21568.001.branch-2.0.patch
>
>
> [~vrodionov] added some API to {{CacheConfig}} via HBASE-17151 to allow 
> callers to specify that they do not want to use a block cache when reading an 
> HFile.
> If the BucketCache is set up to use the FileSystem, we can have a situation 
> where the client tries to instantiate the BucketCache and is disallowed due 
> to filesystem permissions:
> {code:java}
> 2018-12-03 16:22:03,032 ERROR [LoadIncrementalHFiles-0] bucket.FileIOEngine: 
> Failed allocating cache on /mnt/hbase/cache.data
> java.io.FileNotFoundException: /mnt/hbase/cache.data (Permission denied)
>   at java.io.RandomAccessFile.open0(Native Method)
>   at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
>   at java.io.RandomAccessFile.(RandomAccessFile.java:243)
>   at java.io.RandomAccessFile.(RandomAccessFile.java:124)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.(FileIOEngine.java:81)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:382)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.(BucketCache.java:262)
>   at 
> org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:633)
>   at 
> org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:663)
>   at org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:250)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:713)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:621)
>   at 
> org.apache.hadoop.hbase.tool.LoadIncrementalHFiles$3.call(LoadIncrementalHFiles.java:617)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> LoadIncrementalHfiles should provide the {{CacheConfig.DISABLE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server

2018-12-09 Thread Jingyun Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714258#comment-16714258
 ] 

Jingyun Tian commented on HBASE-21565:
--

[~allan163] Thanks for your comment. I ported your patch to this one. And also 
I think [~Apache9] Duo's opinion is reasonable, we can set holdLock to true to 
prevent multiple SCPs for a same server from running concurrently.

> Delete dead server from dead server list too early leads to concurrent Server 
> Crash Procedures(SCP) for a same server
> -
>
> Key: HBASE-21565
> URL: https://issues.apache.org/jira/browse/HBASE-21565
> Project: HBase
>  Issue Type: Bug
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Critical
> Attachments: HBASE-21565.master.001.patch, 
> HBASE-21565.master.002.patch
>
>
> There are 2 kinds of SCP for a same server will be scheduled during cluster 
> restart, one is ZK session timeout, the other one is new server report in 
> will cause the stale one do fail over. The only barrier for these 2 kinds of 
> SCP is check if the server is in the dead server list.
> {code}
> if (this.deadservers.isDeadServer(serverName)) {
>   LOG.warn("Expiration called on {} but crash processing already in 
> progress", serverName);
>   return false;
> }
> {code}
> But the problem is when master finish initialization, it will delete all 
> stale servers from dead server list. Thus when the SCP for ZK session timeout 
> come in, the barrier is already removed.
> Here is the logs that how this problem occur.
> {code}
> 2018-12-07,11:42:37,589 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> 2018-12-07,11:42:58,007 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> {code}
> Now we can see two SCP are scheduled for the same server.
> But the first procedure is finished after the second SCP starts.
> {code}
> 2018-12-07,11:43:08,038 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, 
> state=SUCCESS, hasLock=false; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false 
> in 30.5340sec
> {code}
> Thus it will leads the problem that regions will be assigned twice.
> {code}
> 2018-12-07,12:16:33,039 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, 
> location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, 
> region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on 
> server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
> {code}
> And here we can see the server is removed from dead server list before the 
> second SCP starts.
> {code}
> 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
> {code}
> Thus we should not delete dead server from dead server list immediately.
> Patch to fix this problem will be upload later.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21565) Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server

2018-12-09 Thread Jingyun Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingyun Tian updated HBASE-21565:
-
Attachment: HBASE-21565.master.002.patch

> Delete dead server from dead server list too early leads to concurrent Server 
> Crash Procedures(SCP) for a same server
> -
>
> Key: HBASE-21565
> URL: https://issues.apache.org/jira/browse/HBASE-21565
> Project: HBase
>  Issue Type: Bug
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Critical
> Attachments: HBASE-21565.master.001.patch, 
> HBASE-21565.master.002.patch
>
>
> There are 2 kinds of SCP for a same server will be scheduled during cluster 
> restart, one is ZK session timeout, the other one is new server report in 
> will cause the stale one do fail over. The only barrier for these 2 kinds of 
> SCP is check if the server is in the dead server list.
> {code}
> if (this.deadservers.isDeadServer(serverName)) {
>   LOG.warn("Expiration called on {} but crash processing already in 
> progress", serverName);
>   return false;
> }
> {code}
> But the problem is when master finish initialization, it will delete all 
> stale servers from dead server list. Thus when the SCP for ZK session timeout 
> come in, the barrier is already removed.
> Here is the logs that how this problem occur.
> {code}
> 2018-12-07,11:42:37,589 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> 2018-12-07,11:42:58,007 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> {code}
> Now we can see two SCP are scheduled for the same server.
> But the first procedure is finished after the second SCP starts.
> {code}
> 2018-12-07,11:43:08,038 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, 
> state=SUCCESS, hasLock=false; ServerCrashProcedure 
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false 
> in 30.5340sec
> {code}
> Thus it will leads the problem that regions will be assigned twice.
> {code}
> 2018-12-07,12:16:33,039 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, 
> location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, 
> region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on 
> server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
> {code}
> And here we can see the server is removed from dead server list before the 
> second SCP starts.
> {code}
> 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
> {code}
> Thus we should not delete dead server from dead server list immediately.
> Patch to fix this problem will be upload later.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-12-09 Thread Reid Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714223#comment-16714223
 ] 

Reid Chan commented on HBASE-21246:
---

Thanks for the ping [~an...@apache.org]. I need some time, will be back.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.43.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> HBASE-21246.master.001.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21571) The TestTableSnapshotInputFormat is flaky

2018-12-09 Thread Zheng Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Hu updated HBASE-21571:
-
Attachment: jenkins.tar.gz

> The TestTableSnapshotInputFormat is flaky
> -
>
> Key: HBASE-21571
> URL: https://issues.apache.org/jira/browse/HBASE-21571
> Project: HBase
>  Issue Type: Bug
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: jenkins.tar.gz
>
>
> see: 
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.0/lastSuccessfulBuild/artifact/dashboard.html
> RS aborted because : 
> {code}
> 2018-12-09 00:34:18,635 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=34270] 
> master.MasterRpcServices(514): asf905.gq1.ygridcore.net,32908,1544315637411 
> reported a fatal error:
> * ABORTING region server asf905.gq1.ygridcore.net,32908,1544315637411: 
> Unrecoverable exception while closing region 
> testWithMapReduce,,1544315644043.97dff0ec285658ab3f73d5ca42a97b6e., still 
> finishing close *
> Cause:
> java.io.IOException: The new max sequence id 6 is less than the old max 
> sequence id 7
> at 
> org.apache.hadoop.hbase.wal.WALSplitter.writeRegionSequenceIdFile(WALSplitter.java:684)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1134)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1662)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1479)
> at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21571) The TestTableSnapshotInputFormat is flaky

2018-12-09 Thread Zheng Hu (JIRA)
Zheng Hu created HBASE-21571:


 Summary: The TestTableSnapshotInputFormat is flaky
 Key: HBASE-21571
 URL: https://issues.apache.org/jira/browse/HBASE-21571
 Project: HBase
  Issue Type: Bug
Reporter: Zheng Hu
Assignee: Zheng Hu


see: 
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.0/lastSuccessfulBuild/artifact/dashboard.html

RS aborted because : 
{code}
2018-12-09 00:34:18,635 WARN  
[RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=34270] 
master.MasterRpcServices(514): asf905.gq1.ygridcore.net,32908,1544315637411 
reported a fatal error:
* ABORTING region server asf905.gq1.ygridcore.net,32908,1544315637411: 
Unrecoverable exception while closing region 
testWithMapReduce,,1544315644043.97dff0ec285658ab3f73d5ca42a97b6e., still 
finishing close *
Cause:
java.io.IOException: The new max sequence id 6 is less than the old max 
sequence id 7
at 
org.apache.hadoop.hbase.wal.WALSplitter.writeRegionSequenceIdFile(WALSplitter.java:684)
at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1134)
at 
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1662)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1479)
at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20755) quickstart note about Web UI port changes in ref guide is rendered incorrectly

2018-12-09 Thread Peter Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi updated HBASE-20755:
--
Labels: beginner  (was: )

> quickstart note about Web UI port changes in ref guide is rendered incorrectly
> --
>
> Key: HBASE-20755
> URL: https://issues.apache.org/jira/browse/HBASE-20755
> Project: HBase
>  Issue Type: Bug
>  Components: documentation
>Reporter: Sean Busbey
>Priority: Minor
>  Labels: beginner
> Attachments: Untitled.png
>
>
> The note in the quickstart guide about how the web ui ports changed only 
> renders the title as a note. the text is just a normal paragraph afterwards.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20756) reference guide examples still contain references to EOM versions

2018-12-09 Thread Peter Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi updated HBASE-20756:
--
Labels: beginner  (was: )

> reference guide examples still contain references to EOM versions
> -
>
> Key: HBASE-20756
> URL: https://issues.apache.org/jira/browse/HBASE-20756
> Project: HBase
>  Issue Type: Bug
>  Components: community, documentation
>Reporter: Sean Busbey
>Priority: Minor
>  Labels: beginner
>
> the reference guide still has examples that refer to EOM versions. e.g. this 
> shell output that has 0.98 in it:
> {code}
> $ echo "describe 'test1'" | ./hbase shell -n
> Version 0.98.3-hadoop2, rd5e65a9144e315bb0a964e7730871af32f5018d5, Sat May 31 
> 19:56:09 PDT 2014
> describe 'test1'
> DESCRIPTION  ENABLED
>  'test1', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NON true
>  E', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
>   VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIO
>  NS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =>
>  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false'
>  , BLOCKCACHE => 'true'}
> 1 row(s) in 3.2410 seconds
> {code}
> these should be redone with a current release. Ideally a version in the minor 
> release line the docs are for, but even just updating to the stable pointer 
> would be a big improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20754) quickstart guide should instruct folks to set JAVA_HOME to a JDK installation.

2018-12-09 Thread Peter Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi updated HBASE-20754:
--
Labels: beginner  (was: )

> quickstart guide should instruct folks to set JAVA_HOME to a JDK installation.
> --
>
> Key: HBASE-20754
> URL: https://issues.apache.org/jira/browse/HBASE-20754
> Project: HBase
>  Issue Type: Bug
>  Components: documentation
>Reporter: Sean Busbey
>Priority: Major
>  Labels: beginner
>
> The quickstart guide currently instructs folks to set JAVA_HOME, but to the 
> wrong place
> {code}
> The JAVA_HOME variable should be set to a directory which contains the 
> executable file bin/java. Most modern Linux operating systems provide a 
> mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently 
> switching between versions of executables such as Java. In this case, you can 
> set JAVA_HOME to the directory containing the symbolic link to bin/java, 
> which is usually /usr.
> JAVA_HOME=/usr
> {code}
> instead, it should tell folks to point it to a jdk installation and help them 
> on how to find that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21512) Introduce an AsyncClusterConnection and replace the usage of ClusterConnection

2018-12-09 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713981#comment-16713981
 ] 

Hudson commented on HBASE-21512:


Results for branch HBASE-21512
[build #11 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/11//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Introduce an AsyncClusterConnection and replace the usage of ClusterConnection
> --
>
> Key: HBASE-21512
> URL: https://issues.apache.org/jira/browse/HBASE-21512
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
>
> At least for the RSProcedureDispatcher, with CompletableFuture we do not need 
> to set a delay and use a thread pool any more, which could reduce the 
> resource usage and also the latency.
> Once this is done, I think we can remove the ClusterConnection completely, 
> and start to rewrite the old sync client based on the async client, which 
> could reduce the code base a lot for our client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21487) Concurrent modify table ops can lead to unexpected results

2018-12-09 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713979#comment-16713979
 ] 

Allan Yang commented on HBASE-21487:


{quote}
May be we can pass the old_table_descriptor also in ModifyTableProcedure. In 
MODIFY_TABLE_PREPARE step, we can compare the old_table_descriptor with 
current_table_descriptor, if they are not same then we can throw exception ...
{quote}
I think you are right, it is better to compare table descriptors directly.

> Concurrent modify table ops can lead to unexpected results
> --
>
> Key: HBASE-21487
> URL: https://issues.apache.org/jira/browse/HBASE-21487
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.0.0
>Reporter: Syeda Arshiya Tabreen
>Priority: Major
>
> Concurrent  modifyTable or add/delete/modify columnFamily leads to incorrect 
> result. After HBASE-18893, The behavior of add/delete/modify column family 
> during concurrent operation is changed compare to branch-1.When  one client 
> is adding cf2 and another one cf3 .. In branch-1 final result will be 
> cf1,cf2,cf3 but now either cf1,cf2 OR cf1,cf3 will be the outcome depending 
> on which ModifyTableProcedure executed finally.Its because new table 
> descriptor is constructed before submitting the ModifyTableProcedure in 
> HMaster class and its not guarded by any lock.
> *Steps to reproduce*
> 1.Create table 't' with column family 'f1'
> 2.Client-1 and Client-2 requests to add column family 'f2' and 'f3' on table 
> 't' concurrently.
> *Expected Result*
> Table should have three column families(f1,f2,f3)
> *Actual Result*
> Table 't' will have column family either (f1,f2) or (f1,f3)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21487) Concurrent modify table ops can lead to unexpected results

2018-12-09 Thread Syeda Arshiya Tabreen (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713947#comment-16713947
 ] 

Syeda Arshiya Tabreen commented on HBASE-21487:
---

[~allan163], any thoughts on the issue?

> Concurrent modify table ops can lead to unexpected results
> --
>
> Key: HBASE-21487
> URL: https://issues.apache.org/jira/browse/HBASE-21487
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.0.0
>Reporter: Syeda Arshiya Tabreen
>Priority: Major
>
> Concurrent  modifyTable or add/delete/modify columnFamily leads to incorrect 
> result. After HBASE-18893, The behavior of add/delete/modify column family 
> during concurrent operation is changed compare to branch-1.When  one client 
> is adding cf2 and another one cf3 .. In branch-1 final result will be 
> cf1,cf2,cf3 but now either cf1,cf2 OR cf1,cf3 will be the outcome depending 
> on which ModifyTableProcedure executed finally.Its because new table 
> descriptor is constructed before submitting the ModifyTableProcedure in 
> HMaster class and its not guarded by any lock.
> *Steps to reproduce*
> 1.Create table 't' with column family 'f1'
> 2.Client-1 and Client-2 requests to add column family 'f2' and 'f3' on table 
> 't' concurrently.
> *Expected Result*
> Table should have three column families(f1,f2,f3)
> *Actual Result*
> Table 't' will have column family either (f1,f2) or (f1,f3)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21570) Add write buffer periodic flush support for AsyncBufferedMutator

2018-12-09 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-21570:
-

 Summary: Add write buffer periodic flush support for 
AsyncBufferedMutator
 Key: HBASE-21570
 URL: https://issues.apache.org/jira/browse/HBASE-21570
 Project: HBase
  Issue Type: Sub-task
  Components: asyncclient, Client
Reporter: Duo Zhang


Align with the BufferedMutator interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21567) Allow overriding configs starting up the shell

2018-12-09 Thread Peter Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713916#comment-16713916
 ] 

Peter Somogyi commented on HBASE-21567:
---

Ok, looks like rubocop is complaining for almost everything. Let's have this 
patch like this.

+1

> Allow overriding configs starting up the shell
> --
>
> Key: HBASE-21567
> URL: https://issues.apache.org/jira/browse/HBASE-21567
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3
>
> Attachments: HBASE-21567.master.001.patch, 
> HBASE-21567.master.002.patch, HBASE-21567.master.003.patch
>
>
> Needed to be able to point a local install at a remote cluster. I wanted to 
> be able to do this:
> ${HBASE_HOME}/bin/hbase shell 
> -Dhbase.zookeeper.quorum=ZK0.remote.cluster.example.org,ZK1.remote.cluster.example.org,ZK2.remote.cluster.example.org



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)