[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 1:13 AM:
--

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check as the time we will make timestamp 
substitutions with.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. 

Edit: The straightforward thing to do is a synchronized section that does all 
of the existence tests with the set but doesn't mutate the set until we have a 
pass or yield decision, at which point the result is merged to the set or not. 
Hmm,..

Edit2: I like this phrasing ("picking from the available pool of disjoint row 
keys that can share the same tick") but it's first come first serve. We don't 
pick among alternatives. (That feels like constraint solving... all for one 
tick?) It's just a one pass filter. 


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check as the time we will make timestamp 
substitutions with.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. 

Edit: The straightforward thing to do is a synchronized section that does all 
of the existence tests with the set but doesn't mutate the set until we have a 
pass or yield decision, at which point the result is merged to the set or not. 
Hmm,..

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 1:01 AM:
--

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check as the time we will make timestamp 
substitutions with.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. 

Edit: The straightforward thing to do is a synchronized section that does all 
of the existence tests with the set but doesn't mutate the set until we have a 
pass or yield decision, at which point the result is merged to the set or not. 
Hmm,..


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check as the time we will make timestamp 
substitutions with.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I need to implement a tri-state check: row 
does not conflict (pass), row does conflict (yield and retry), row might 
conflict (retry immediately)

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:54 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check as the time we will make timestamp 
substitutions with.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I need to implement a tri-state check: row 
does not conflict (pass), row does conflict (yield and retry), row might 
conflict (retry immediately)


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I need to implement a tri-state check: row 
does not conflict (pass), row does conflict (yield and retry), row might 
conflict (retry immediately)

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:53 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I need to implement a tri-state check: row 
does not conflict (pass), row does conflict (yield and retry), row might 
conflict (retry immediately)


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I don't want a Set, I want a Map where the 
values can be used to implement a tri-state check: row does not conflict 
(pass), row does conflict (yield and retry), row might conflict (retry 
immediately)

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:52 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I don't want a Set, I want a Map where the 
values can be used to implement a tri-state check: row does not conflict 
(pass), row does conflict (yield and retry), row might conflict (retry 
immediately)


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I don't want a Set, I want a Map of 
reference counts. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:50 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple. I don't want a Set, I want a Map of 
reference counts. 


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple.

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:49 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

One thing I realized I'm missing while typing this up is some kind of rollback 
if we added rows to the set for something which overlaps elsewhere, and so 
might block some other pending operation on rows for this mutation that is 
going to yield instead of go forward. That's what I was alluding to above... 
what I have right now is too simple.


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:47 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Get a reference to the row set for the current tick. Run through all of the 
rows involved in the pending mutation. The check adds each row to the set, but 
allows the pending mutation as long as each row did not exist in the set when 
it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Run through all of the rows involved in the pending mutation. The check adds 
each row to the set, but allows the pending mutation as long as each row did 
not exist in the set when it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:45 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Run through all of the rows involved in the pending mutation. The check adds 
each row to the set, but allows the pending mutation as long as each row did 
not exist in the set when it was added.
- If the pending mutation is going to be allowed to go forward, return the time 
where we did the atomic clock advance check.
- Otherwise, yield the thread, and try again, which will keep the handler in 
queue until the row set is finally disjoint from any other (per the check logic 
described above)


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Run through all of the rows involved in the pending mutation. The check adds 
each row to the set, but allows the pending mutation if none of the rows hit 
while adding. 
- If the pending mutation is allowed, return the time where we did the atomic 
clock advance check.
- Otherwise, yield the thread, and try again. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:44 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

The basic idea is this: 
- First, check if the clock has advanced, atomically. If it has, then clear the 
row set, also atomically, with respect to the clock advance check. 
- Run through all of the rows involved in the pending mutation. The check adds 
each row to the set, but allows the pending mutation if none of the rows hit 
while adding. 
- If the pending mutation is allowed, return the time where we did the atomic 
clock advance check.
- Otherwise, yield the thread, and try again. 


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:41 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back on that subtask when I post an update 
there. 


was (Author: apurtell):
{quote}
This is an interesting idea, trying to think about it a bit more. IIUC the idea 
here is to maximize the commit throughput as much as possible in a single tick 
by picking from the available pool of disjoint row keys that can share the same 
tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back on that subtask when I post an update 
there. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358964#comment-17358964
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:41 AM:
---

{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back to that subtask in a bit, when I post an 
update there. 


was (Author: apurtell):
{quote}
IIUC the idea here is to maximize the commit throughput as much as possible in 
a single tick by picking from the available pool of disjoint row keys that can 
share the same tick. Let me take a look at the patch..
{quote}
This is the HBASE-25975 subtask. I still need to add a test that proves it even 
works. :-) You might want to come back on that subtask when I post an update 
there. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/8/21, 12:39 AM:
---

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be IncrementAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that Clock's scope. For a single row update the scope 
would be a row, and highly granular, so performance will not be affected that 
much. For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (_strictly speaking, it is the union set of the rows in 
the batch mutation_ ... I will come back to this later), and serializing 
commits to a region is expensive. Unfortunately it is always the case that the 
region must be the scope of control. Consider multiple RPC handlers executing 
in parallel. One is a single row mutation for rowA. Another is a batch mutation 
for rowA, rowB, and rowC. If we "optimize" single row updates without respect 
to concurrent batch requests, with two separate clock instances for row(rowA) 
and region(rowA,rowB,rowC), then rowA in this scenario might get two commits in 
the same system clock tick, because two clocks could have different notions of 
when time last advanced. We've done a lot of work to gain no assurance. So that 
is an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
*hierarchical clocks* where one might get a clock for the region, then use that 
clock to get child clocks for a row, and the hierarchy would make sure that 
time would not run backwards for anyone even if the row clocks were of the 
IncrementXXX types (because spinning or yielding is simply too expensive to do 
outright). This was not successful in the way I would like because they always 
ended up serializing _something_ at region scope in a way that I thought would 
impact performance rather negatively. 

So, let's take a step back: What is it we are really tracking? 

We are tracking whether or not a row has already been committed to in the 
current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 7:02 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that Clock's scope. For a single row update the scope 
would be a row, and highly granular, so performance will not be affected that 
much. For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. Unfortunately it is always the case that the region must be the 
scope of control. Consider multiple RPC handlers executing in parallel. One is 
a single row mutation for rowA. Another is a batch mutation for rowA, rowB, and 
rowC. If we "optimize" single row updates without respect to concurrent batch 
requests, with two separate clock instances for row(rowA) and 
region(rowA,rowB,rowC), then rowA in this scenario might get two commits in the 
same system clock tick, because two clocks could have different notions of when 
time last advanced. We've done a lot of work to gain no assurance. So that is 
an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
*hierarchical clocks* where one might get a clock for the region, then use that 
clock to get child clocks for a row, and the hierarchy would make sure that 
time would not run backwards for anyone even if the row clocks were of the 
IncrementXXX types (because spinning or yielding is simply too expensive to do 
outright). This was not successful in the way I would like because they always 
ended up serializing _something_ at region scope in a way that I thought would 
impact performance rather negatively. 

So, let's take a step back: What is it we are really tracking? 

We are tracking whether or not a row has already been committed to in the 
current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 7:02 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that Clock's scope. For a single row update the scope 
would be a row, and highly granular, so performance will not be affected that 
much. For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (_strictly speaking, it is the union set of the rows in 
the batch mutation_ ... I will come back to this later), and serializing 
commits to a region is expensive. Unfortunately it is always the case that the 
region must be the scope of control. Consider multiple RPC handlers executing 
in parallel. One is a single row mutation for rowA. Another is a batch mutation 
for rowA, rowB, and rowC. If we "optimize" single row updates without respect 
to concurrent batch requests, with two separate clock instances for row(rowA) 
and region(rowA,rowB,rowC), then rowA in this scenario might get two commits in 
the same system clock tick, because two clocks could have different notions of 
when time last advanced. We've done a lot of work to gain no assurance. So that 
is an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
*hierarchical clocks* where one might get a clock for the region, then use that 
clock to get child clocks for a row, and the hierarchy would make sure that 
time would not run backwards for anyone even if the row clocks were of the 
IncrementXXX types (because spinning or yielding is simply too expensive to do 
outright). This was not successful in the way I would like because they always 
ended up serializing _something_ at region scope in a way that I thought would 
impact performance rather negatively. 

So, let's take a step back: What is it we are really tracking? 

We are tracking whether or not a row has already been committed to in the 
current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:58 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. Unfortunately it is always the case that the region must be the 
scope of control. Consider multiple RPC handlers executing in parallel. One is 
a single row mutation for rowA. Another is a batch mutation for rowA, rowB, and 
rowC. If we "optimize" single row updates without respect to concurrent batch 
requests, with two separate clock instances for row(rowA) and 
region(rowA,rowB,rowC), then rowA in this scenario might get two commits in the 
same system clock tick, because two clocks could have different notions of when 
time last advanced. We've done a lot of work to gain no assurance. So that is 
an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
*hierarchical clocks* where one might get a clock for the region, then use that 
clock to get child clocks for a row, and the hierarchy would make sure that 
time would not run backwards for anyone even if the row clocks were of the 
IncrementXXX types (because spinning or yielding is simply too expensive to do 
outright). This was not successful in the way I would like because they always 
ended up serializing _something_ at region scope in a way that I thought would 
impact performance rather negatively. 

So, let's take a step back: What is it we are really tracking? 

We are tracking whether or not a row has already been committed to in the 
current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:55 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. Unfortunately it is always the case that the region must be the 
scope of control. Consider multiple RPC handlers executing in parallel. One is 
a single row mutation for rowA. Another is a batch mutation for rowA, rowB, and 
rowC. If we "optimize" single row updates without respect to concurrent batch 
requests, with two separate clock instances for row(rowA) and 
region(rowA,rowB,rowC), then rowA in this scenario might get two commits in the 
same system clock tick, because two clocks could have different notions of when 
time last advanced. We've done a lot of work to gain no assurance. So that is 
an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
*hierarchical clocks* where one might get a clock for the region, then use that 
clock to get child clocks for a row, and the hierarchy would make sure that 
time would not run backwards for anyone. This was not successful in the way I 
would like because they always ended up serializing _something_ at region scope 
in a way that I thought would impact performance rather negatively. 

So, let's take a step back: What is it we are really tracking? 

We are tracking whether or not a row has already been committed to in the 
current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, but may become 
more complex when considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:53 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. Unfortunately it is always the case that the region must be the 
scope of control. Consider multiple RPC handlers executing in parallel. One is 
a single row mutation for rowA. Another is a batch mutation for rowA, rowB, and 
rowC. If we "optimize" single row updates without respect to concurrent batch 
requests, with two separate clock instances for row(rowA) and 
region(rowA,rowB,rowC), then rowA in this scenario might get two commits in the 
same system clock tick, because two clocks could have different notions of when 
time last advanced. We've done a lot of work to gain no assurance. So that is 
an invalid optimization and so the HBASE-25913 change creates a Clock per 
region and all timekeeping for the region goes through that Clock. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, but may become 
more complex when considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:49 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. Unfortunately it is always the case that the region must be the 
scope of control. Consider multiple RPC handlers executing in parallel. One is 
a single row mutation for rowA. Another is a batch mutation for rowA, rowB, and 
rowC. If we "optimize" single row updates without respect to concurrent batch 
requests, with two separate clock instances for row(rowA) and 
region(rowA,rowB,rowC), then rowA in this scenario might get two commits in the 
same system clock tick. We've done a lot of work to gain no assurance.

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, but may become 
more complex when considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:45 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. The 
potential performance impacts were known at the outset of the work so are not 
surprising. With some implementations now the impacts can be measured.

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, but may become 
more complex when considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:44 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. 

With BoundedIncrementYieldAdvancingClock that means the region can take a burst 
of commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we yield until the system time catches up 
(which can be up to 1 second). Other threads on other regions will make 
progress, but for the client attempting to commit a burst with the yielding 
operation, performance is a concern. With YieldAdvancingClock, the region can 
take only one commit for an entire clock tick, which is way worse. 

This should be managed at the row scope, not the region scope, but then what to 
do about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? 

This idea is being explored with HBASE-25975, which does that with a region 
level data structure based on atomics because access to it will be highly 
multithreaded. In some ways it is much simpler in retrospect, but may become 
more complex when considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:42 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. With BoundedIncrementYieldAdvancingClock that means the region can 
take a burst of commits within a single clock tick up to the limit and then 
there is a performance discontinuity while we spin until the system time 
catches up (which can be up to 1 second). With YieldAdvancingClock, the region 
can take only one commit for an entire clock tick, which is way worse. This 
should be managed at the row scope, not the region scope, but then what to do 
about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? So this idea is 
being explored with HBASE-25975, which does that with a region level data 
structure based on atomics because access to it will be highly multithreaded. 
In some ways it is much simpler in retrospect, but may become more complex when 
considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-06-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:41 PM:
--

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is the region, and 
serializing commits to a region is expensive. With 
BoundedIncrementYieldAdvancingClock that means the region can take a burst of 
commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we spin until the system time catches up (which 
can be up to 1 second). With YieldAdvancingClock, the region can take only one 
commit for an entire clock tick, which is way worse. This should be managed at 
the row scope, not the region scope, but then what to do about batch mutations? 
I struggled for a while with various forms of hierarchical clocks where one 
might get a clock for the region, then use that clock to get child clocks for a 
row, and was not successful because they always ended up serializing 
_something_ at region scope, but then reconsidered... What is it we are really 
tracking? We are tracking whether or not a row has already been committed to in 
the current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? So this idea is being explored with 
HBASE-25975, which does that with a region level data structure based on 
atomics because access to it will be highly multithreaded. In some ways it is 
much simpler in retrospect, but may become more complex when considering 
optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep 

[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-05-24 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350711#comment-17350711
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 5/25/21, 1:39 AM:
---

I'm considering something simple:
 * Ensure our discipline in using {{EnvironmentEdge#currentTime}} instead of 
{{System#currentTimeMillis}} (HBASE-25912)
 * Fix current issues (HBASE-25911)
 * Introduce new {{EnvironmentEdge#currentTimeAdvancing}} which ensures that 
when the current time is returned, it is the current time in a different clock 
tick from the last time the {{EnvironmentEdge}} was used to get the current 
time. Use {{EnvironmentEdge#currentTimeAdvancing}} wherever we go to substitute 
a {{Long.MAX_VALUE}} timestamp placeholder with a real timestamp when 
processing mutations. See HBASE-25913 for more details.


was (Author: apurtell):
I'm considering something simple:
 * Ensure our discipline in using {{EnvironmentEdge#currentTime}} instead of 
{{System#currentTimeMillis}} (HBASE-25912)
 * Fix current issues (HBASE-25911)
 * Introduce new {{EnvironmentEdge#currentTimeAdvancing}} which ensures that 
when the current time is returned, it is the current time in a different clock 
tick from the last time the {{EnvironmentEdge}} was used to get the current 
time. Use {{EnvironmentEdge#currentTimeAdvancing}} wherever we go to substitute 
a {{Long.MAX_VALUE}} timestamp placeholder with a real placeholder. See 
HBASE-25913 for more details.

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2021-05-24 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350711#comment-17350711
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 5/25/21, 1:28 AM:
---

I'm considering something simple:
 * Ensure our discipline in using {{EnvironmentEdge#currentTime}} instead of 
{{System#currentTimeMillis}} (HBASE-25912)
 * Fix current issues (HBASE-25911)
 * Introduce new {{EnvironmentEdge#currentTimeAdvancing}} which ensures that 
when the current time is returned, it is the current time in a different clock 
tick from the last time the {{EnvironmentEdge}} was used to get the current 
time. Use {{EnvironmentEdge#currentTimeAdvancing}} wherever we go to substitute 
a {{Long.MAX_VALUE}} timestamp placeholder with a real placeholder. See 
HBASE-25913 for more details.


was (Author: apurtell):
I'm considering something simple:
 * Ensure our discipline in using {{EnvironmentEdge#currentTime}} instead of 
{{System#currentTimeMillis}} (HBASE-25912)
 * Fix current issues (HBASE-25911)
 * Introduce new {{EnvironmentEdge#currentTimeAdvancing}} which ensures that 
when the current time is returned, it is the current time in a different clock 
tick from the last time the {{EnvironmentEdge}} was used to get the current 
time. 
 * Use {{EnvironmentEdge#currentTimeAdvancing}} wherever we go to substitute a 
{{Long.MAX_VALUE}} timestamp placeholder with a real placeholder. When 
processing a batch mutation we will call {{currentTimeAdvancing}} only once. 
This means the client cannot bundle cells with wildcard timestamps into a batch 
where those cells must be committed with different timestamps. Clients must 
simply not submit mutations that must be committed with guaranteed distinct 
timestamps in the same batch. Easy to understand, easy to document, and it 
aligns with our design philosophy of the client knows best. It will be fine to 
continue to use {{EnvironmentEdge#currentTime}} everywhere else. In this way we 
will only potentially spin wait where it matters, and won't suffer serious 
overheads during batch processing.

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2020-06-01 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121293#comment-17121293
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/1/20, 8:12 PM:
--

I am aware. If we do this I don’t think we will need it at all, configurable or 
not. But that is out of scope for this issue.

Edit: Some might respond, validly, that this is splitting hairs, because one 
follows the other: If we will never have two exact keys including timestamps 
ever committed to a row, then we don't need a sorting rule by operator 
precedence for a case that, after this proposed change, can never happen. I am 
proposing we do it in steps, with small reversible changes, because this is 
such a critical area for correctness, but if the consensus is to do it 
together, I would not oppose that for what it's worth.


was (Author: apurtell):
I am aware. If we do this I don’t think we will need it at all, configurable or 
not. But that is out of scope for this issue.

Edit: Some might respond, validly, that this is splitting hairs, because one 
follows the other: If we will never have two exact keys including timestamps 
ever committed to a row, then we don't need a sorting rule by operator 
precedence. I am proposing we do it in steps, with small reversible changes, 
because this is such a critical area for correctness, but if the consensus is 
to do it together, I would not oppose that for what it's worth.

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Priority: Major
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2020-06-01 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121293#comment-17121293
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/1/20, 8:11 PM:
--

I am aware. If we do this I don’t think we will need it at all, configurable or 
not. But that is out of scope for this issue.

Edit: Some might respond, validly, that this is splitting hairs, because one 
follows the other: If we will never have two exact keys including timestamps 
ever committed to a row, then we don't need a sorting rule by operator 
precedence. I am proposing we do it in steps, with small reversible changes, 
because this is such a critical area for correctness, but if the consensus is 
to do it together, I would not oppose that for what it's worth.


was (Author: apurtell):
I am aware. If we do this I don’t think we will need it at all, configurable or 
not. But that is out of scope for this issue. 

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Priority: Major
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24440) Prevent temporal misordering on timescales smaller than one clock tick

2020-05-26 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116934#comment-17116934
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 5/26/20, 5:55 PM:
---

Note this is a trick Apache Phoenix already uses to ensure uniqueness of 
timestamps for indexes. /cc [~gjacoby]


was (Author: apurtell):
Note this is a trick Apache Phoenix uses to ensure uniqueness of timestamps for 
indexes.

> Prevent temporal misordering on timescales smaller than one clock tick
> --
>
> Key: HBASE-24440
> URL: https://issues.apache.org/jira/browse/HBASE-24440
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Priority: Major
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. We would do this somehow via the memstore. Assuming we 
> can efficiently know if we are about to commit twice to the same row within a 
> single clock tick, we would simply sleep/yield the current thread until the 
> clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)