[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-08-23 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139293#comment-16139293
 ] 

Enis Soztutar commented on HBASE-18432:
---

I don't understand the concerns here. You are right that the bits for the LT 
component bounds maximum number of events that can happen for the largest clock 
skew, or largest case of clock-correction (leap second, etc). That is why the 
PT+LT bits was chosen to be a balance between the max time representable in PT 
+ max logical events representable in LT. 

We are already running with max_skew=30 by default I believe, and the default 
action should be to kick the server out of the cluster. However, running with 
>10sec clock skew is shown to cause problems anyways regardless whether there 
is HLC or not. The answer to the concerns here is to always run with NTP, which 
is the recommended setting for HBase clusters anyway. 

> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch, 
> HBASE-18432.HBASE-14070.HLC.002.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> 
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind 

[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-08-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119262#comment-16119262
 ] 

Hadoop QA commented on HBASE-18432:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
32s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
51s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
39m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha4. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
50s{color} | {color:green} hbase-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 52m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:757bf37 |
| JIRA Issue | HBASE-18432 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12880936/HBASE-18432.HBASE-14070.HLC.002.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 9da3de9a3ace 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | HBASE-14070.HLC / d9a9904 |
| Default Java | 1.8.0_144 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7989/testReport/ |
| modules | C: hbase-common U: hbase-common |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7989/console |
| Powered by | Apache Yetus 0.4.0   http://yetus.apache.org |


This message was automatically generated.



> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch, 
> 

[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-08-02 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110879#comment-16110879
 ] 

stack commented on HBASE-18432:
---

bq. how to handle skews...

NTP? If not, lets take on this issue [~appy] Thanks.

> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> 
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind master by 20 sec. On each communication from master, RS 
> will update its own PT to master's PT, and it'll remain that till RS's ST 
> catches up. If there are frequent communication from master, ST might never 
> catch up and RS's PT will actually look like discrete time jumps rather than 
> continuous time.
> For eg. If master communicated with RS at times 30, 40, 50 (RSs corresponding 
> times are 10, 20, 30), than all events on RS between time [10, 50] will be 
> timestamped with either 30, 40 or 50.
> —



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-08-01 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110127#comment-16110127
 ] 

Appy commented on HBASE-18432:
--

Breaking out design changes in 001.patch into separate jira (HBASE-18498) so 
that this one can focus on real problem - how to handle skews?

> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> 
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind master by 20 sec. On each communication from master, RS 
> will update its own PT to master's PT, and it'll remain that till RS's ST 
> catches up. If there are frequent communication from master, ST might never 
> catch up and RS's PT will actually look like discrete time jumps rather than 
> continuous time.
> For eg. If master communicated with RS at times 30, 40, 50 (RSs corresponding 
> times are 10, 20, 30), than all events on RS between time [10, 50] will be 
> timestamped with either 30, 40 or 50.
> —



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-07-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099237#comment-16099237
 ] 

Appy commented on HBASE-18432:
--

Quickly went through design doc attached in parent jira 
(https://issues.apache.org/jira/secure/attachment/12745165/HybridLogicalClocksforHBaseandPhoenix.pdf),
 but didn't find answers/solutions to problems raised above. Specifically, 
there's nothing addressing the problems in headings 'Representing HLC in 64 
bits', 'Dealing with LC overflow', 'Clock drift protection', and implementation 
actions.
Since that doc doesn't go into exact implementation of clocks, i think it's 
natural that such nuances and issues weren't covered/seen then. But now that we 
have concrete implementations, it's easier to see these issues now.

bq. The size of the logical component has been carefully considered (see design 
attached). Do we have to revisit? If not enough ticks to catch-up when skew, 
then the server should not be allowed take edits – reject writes for a while – 
and if it goes on too long, should be ejected from the cluster.

In current implementation, we are throwing ClockException on overflow, which'll 
crash the server immediately. I think that part is fine.
In doc,  'Dealing with LC overflow' talks about how LC window of 65k events in 
1ms if more than sufficient. That's true, but only when clock keeps moving 
forward.
Note that the nature of problems i have mentioned above have a common theme - 
stopping the Physical time when there's a skew.
It's an implementation issue with physical clock part of the time. (I realize 
that i forgot to update problem 1 text after my solution. Removing the bits 
calculation part since that won't be needed if we keep moving PT forward with 
this patch)


bq. On problem #2, if skew of 30seconds, just eject from cluster. It is lagging 
beyond our configured max.
Problem 2 is not about large skew. Let me update RS3's time slightly so other's 
don't get distracted by the choice of time in the example.

bq. What is 'skew'. It is diff between Master time and our time (a RS)?
yes.
Another way of looking at it is - 'time correction'

bq. Skew can be +/-?
In current implementation, we are not updating time when it's less than current 
time. We can't go backwards. So i kept skew to be +ve.
Can be discussed more.

bq. This 'catch-up' on skew is always going on given it rare that skew == 0?
Yes. but see it is as correction.

bq. Does the 'catch-up' mechanism happen only when skew is > max_skew or is 
max_skew when we take ourselves out of the cluster (kill ourselves?)
On skew > max_skew, server kills itself.

bq. So we do System.currentTimeMillis + current skew? When does skew get 
changed?
When we get update() with a larger timestamp than our (current time + 
skew/correction), we update our skew/correction.
Currently skew cannot decrease. I think this needs more thought.

But the large idea is, we need to keep Physical time moving, unlike current 
implementation.


> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in 

[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-07-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097559#comment-16097559
 ] 

stack commented on HBASE-18432:
---

On the solution:

bq. Keep track of skew in clock. And instead of keeping track of physical time, 
always compute it by adding system time and skew.

Skew in system clock?

What is 'skew'. It is diff between Master time and our time (a RS)?

Skew can be +/-?

This 'catch-up' on skew is always going on given it rare that skew == 0?

This solution is for which clocks?

bq. On update(), recalculate skew and validate if it's greater than max_skew.

Does the 'catch-up' mechanism happen only when skew is > max_skew or is 
max_skew when we take ourselves out of the cluster (kill ourselves?)

bq. On toTimestamp(), calculate PT = ST+skew.

So we do System.currentTimeMillis + current skew?

When does skew get changed?

> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> 
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind master by 20 sec. On each communication from master, RS 
> will update its own 

[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-07-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097551#comment-16097551
 ] 

stack commented on HBASE-18432:
---

On problem #1, 'That means, in worst case, our logical clock window should be 
big enough to support all the events that can happen in max skew time.'

The size of the logical component has been carefully considered (see design 
attached). Do we have to revisit? If not enough ticks to catch-up when skew, 
then the server should not be allowed take edits -- reject writes for a while 
-- and if it goes on too long, should be ejected from the cluster.

On problem #2, if skew of 30seconds, just eject from cluster. It is lagging 
beyond our configured max.

On problem #3, yeah, this is a good problem against which we should have 
protection in place.

Yeah, problem #4 seems to be just a nuisance, not a 'problem'.

Java can't set system time. NTP needs to be in place and working to 'fix' skew. 
If NTP not running, cluster can't proceed.

Let me look at patch. What you think of it Amit?

> Prevent clock from getting stuck after update()
> ---
>
> Key: HBASE-18432
> URL: https://issues.apache.org/jira/browse/HBASE-18432
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> 
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> 
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> 
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -
> -
> Issues with current approach:
> 
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> 
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole 

[jira] [Commented] (HBASE-18432) Prevent clock from getting stuck after update()

2017-07-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097120#comment-16097120
 ] 

Hadoop QA commented on HBASE-18432:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
12s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 1s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
36s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
39s{color} | {color:red} hbase-server in HBASE-14070.HLC has 9 extant Findbugs 
warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} HBASE-14070.HLC passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
29m  3s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha4. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
23s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
15s{color} | {color:red} hbase-common generated 1 new + 0 unchanged - 0 fixed = 
1 total (was 0) {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  2m 14s{color} 
| {color:red} hbase-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 14s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
58s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}125m  5s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.TestTimestampType |
| Timed out junit tests | 
org.apache.hadoop.hbase.master.procedure.TestDisableTableProcedure |
|   | org.apache.hadoop.hbase.regionserver.wal.TestSecureWALReplay |
|   | org.apache.hadoop.hbase.master.procedure.TestModifyTableProcedure |
|   | org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure |
|   | org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure |
|   | org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure |
|   | org.apache.hadoop.hbase.master.procedure.TestDeleteTableProcedure |
|   | org.apache.hadoop.hbase.regionserver.TestRowTooBig |
|   | org.apache.hadoop.hbase.regionserver.TestSplitLogWorker |
|   | org.apache.hadoop.hbase.regionserver.wal.TestAsyncWALReplay |
|   | org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence |
|   | org.apache.hadoop.hbase.coprocessor.TestHTableWrapper |
|   |