[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577292#comment-15577292
 ] 

Todd Lipcon commented on KUDU-1618:
---

bq. Todd Lipcon thanks for a quick reply, by 'shouldn't have a replica' in 
above comment, you meant current tablet server where we are trying to bring up 
the replica, is not part of raft config for that tablet anymore right ? It has 
other tservers as replicas at this point. That makes sense. I believe tserver 
keeps trying until there may be another change_config in future which brings in 
this tserver as replica for that tablet.

Right, when you copied the replica it copied the new configuration, and it's 
not a part of that configuration. So, it knows that it shouldn't try to get the 
other nodes to vote for it. It would be reasonable to say that we should detect 
this scenario and mark the tablet as 'failed', but it's actually somewhat 
useful occasionally -- eg I've used this before to copy a tablet from a running 
cluster onto my laptop so I could then use tools like 'dump_tablet' against it 
locally. Given that I don't think you can get into this state without using 
explicit tablet copy repair tools, I don't think it should really be considered 
a bug.

bq. One follow up Qn is: What state should the replica be in after step 6 ? I 
see it in RUNNING state, which was slightly confusing, because this replica 
isn't an active replica at this point.

The tablet's state is referring more to the data layer. It's up and running, it 
has replayed its log, it has valid data, etc. So it's RUNNING even though it's 
not actually an active part of any raft configuration. If you run ksck on this 
cluster does ksck report the "extra" replica anywhere? That might be a useful 
thing to do so we can detect if this ever happens in real life.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-1702) Document/Implement read-your-writes for Impala/Spark etc.

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1702:
--
Description: 
Engines like Impala/Spark use many independent client instances, so we should 
provide a way to have read-your-writes across many independent client 
instances, which translates to providing a way to get linearizable behavior. 

At first this can be done using the APIs that are already available. For 
instance if the objective is to be sure to have the results of a write in a a 
following scan, the following steps can be taken:
- After a write the engine should collect the last observed timestamps from 
kudu clients
- The engine's coordinator then takes the max of those timestamps, adds 1 and 
uses that as a snapshot scan timestamp.

One important pre-requisite of the behavior above is that scans be done in 
READ_AT_SNAPSHOT mode. Also the steps above currently don't actually guarantee 
the expected behavior, but should as the currently anomalies are taken care of 
(as part of KUDU-430).

In the immediate future we'll add APIs to the Kudu client so as to make the 
inner workings of getting this behavior oblivious to the engine. The steps will 
still be the same, i.e. timestamps or timestamp tokens will still be passed 
around, but the kudu client will encapsulate the choice of the timestamp for 
the scan.

Later we will add a way to obtain this behavior without timestamp propagation, 
either by doing a write-side commit-wait, where clients wait out the clock 
error after/during the last write thus making sure any future operation will 
have a higher timestamp; or by making read-side commit wait, where we provide 
an api on the kudu client for the engine to perform a similar call before the 
scan call to obtain a scan timestamp.

  was:
Engines like Impala/Spark use many independent client instances, so we should 
provide a way to have read-your-writes across many independent client 
instances, which translates to provide a way to get linearizable behavior.

At first this can be done using the APIs that are already available. For 
instance if the objective is to be sure to have the results of a write in a a 
following scan, the following steps can be taken:
- After a write the engine should collect the last observed timestamps from 
kudu clients
- The engine's coordinator then takes the max of those timestamps, adds 1 and 
uses that as a snapshot scan timestamp.

One important pre-requisite of the behavior above is that scans be done in 
READ_AT_SNAPSHOT mode. Also the steps above currently don't actually guarantee 
the expected behavior, but should as the currently anomalies are taken care of 
(as part of KUDU-430).

In the immediate future we'll add APIs to the Kudu client so as to make the 
inner workings of getting this behavior oblivious to the engine. The steps will 
still be the same, i.e. timestamps or timestamp tokens will still be passed 
around, but the kudu client will encapsulate the choice of the timestamp for 
the scan.

Later we will add a way to obtain this behavior without timestamp propagation, 
either by doing a write-side commit-wait, where clients wait out the clock 
error after/during the last write thus making sure any future operation will 
have a higher timestamp; or by making read-side commit wait, where we provide 
an api on the kudu client for the engine to perform a similar call before the 
scan call to obtain a scan timestamp.


> Document/Implement read-your-writes for Impala/Spark etc.
> -
>
> Key: KUDU-1702
> URL: https://issues.apache.org/jira/browse/KUDU-1702
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client, tablet, tserver
>Affects Versions: 1.1.0
>Reporter: David Alves
>Assignee: David Alves
>
> Engines like Impala/Spark use many independent client instances, so we should 
> provide a way to have read-your-writes across many independent client 
> instances, which translates to providing a way to get linearizable behavior. 
> At first this can be done using the APIs that are already available. For 
> instance if the objective is to be sure to have the results of a write in a a 
> following scan, the following steps can be taken:
> - After a write the engine should collect the last observed timestamps from 
> kudu clients
> - The engine's coordinator then takes the max of those timestamps, adds 1 and 
> uses that as a snapshot scan timestamp.
> One important pre-requisite of the behavior above is that scans be done in 
> READ_AT_SNAPSHOT mode. Also the steps above currently don't actually 
> guarantee the expected behavior, but should as the currently anomalies are 
> taken care of (as part of KUDU-430).
> In the immediate future we'll add APIs to the Kudu client so as to make the 
> inner workings of getting this 

[jira] [Resolved] (KUDU-1368) Setting snapshot timestamp to last propagated timestamp should include prior writes

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves resolved KUDU-1368.
---
   Resolution: Won't Fix
Fix Version/s: 1.1.0

Setting the snapshot scan's timestamp based on the last observer timestamp + 1 
is a hack we came up for tests, to make sure we would get RYW at the same time 
minimizing the server's chance to block on the scan (because of in-flights or 
safe time advancement). When the GA consistency work is done, this will no 
longer be necessary.

KUDU-1679 will make sure that, even for the current snapshot scans with no 
provided timestamp ("now" scans), the timestamp taken by the server for the 
read will be higher than the time of the last write (which was not guaranteed 
before).

KUDU-1704 Will improve on this further by allowing bounded staleness scans, 
going a bit further than the hack since it will allow to read more recent data.

> Setting snapshot timestamp to last propagated timestamp should include prior 
> writes
> ---
>
> Key: KUDU-1368
> URL: https://issues.apache.org/jira/browse/KUDU-1368
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 0.7.0
>Reporter: Todd Lipcon
> Fix For: 1.1.0
>
>
> If I do some writes and then use 
> scanner.SetSnapshotRaw(client->GetLastPropagatedTimestamp()), it seems like 
> the snapshot that gets generated does not include the writes I did. I need to 
> add one to get "read your writes", which seems unintuitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Dinesh Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576389#comment-15576389
 ] 

Dinesh Bhat commented on KUDU-1618:
---

[~tlipcon] thanks for a quick reply, by 'shouldn't have a replica' in above 
comment, you meant current tablet server where we are trying to bring up the 
replica, is not part of raft config for that tablet anymore right ? It has 
other tservers as replicas at this point. That makes sense. I believe tserver 
keeps trying until there may be another change_config in future which brings in 
this tserver as replica for that tablet.
One follow up Qn is: What state should the replica be in after step 6 ? I see 
it in RUNNING state, which was slightly confusing, because this replica isn't 
an active replica at this point.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-430) Consistent Operations

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-430:
-
Description: 
This ticket tracks consistency/isolation work for GA.

Scope Doc: 
https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit#

The sub-tasks that don't target GA will likely be moved somewhere else, or 
promoted to tasks once this ticket is done, but for now it's handy to have a 
single view of all the remaining work



  was:
A number of small subtasks remain before we fully support snapshot consistency.

In particular, a few of the issues:
- right now, after compactions, we can lose history for a given row, and then a 
snapshot read in the past wouldn't produce correct results.
- the C++ client doesn't handle timestamp propagation
- we need to evaluate and make sure all of our APIs are in good shape in both 
Java and C++ clients
- need to add some security (hashes) around timestamp propagation to prevent 
malicious clients from mucking with our machinery

Scope Doc: 
https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit#


> Consistent Operations
> -
>
> Key: KUDU-430
> URL: https://issues.apache.org/jira/browse/KUDU-430
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, tablet, tserver
>Affects Versions: M4
>Reporter: Todd Lipcon
>Assignee: David Alves
>  Labels: kudu-roadmap
>
> This ticket tracks consistency/isolation work for GA.
> Scope Doc: 
> https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit#
> The sub-tasks that don't target GA will likely be moved somewhere else, or 
> promoted to tasks once this ticket is done, but for now it's handy to have a 
> single view of all the remaining work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-237) Support for encoding REINSERT

2016-10-14 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576318#comment-15576318
 ] 

David Alves commented on KUDU-237:
--

I'm tentatively targeting this for GA. An old patch is available at: 
https://gerrit.cloudera.org/#/c/4627/

> Support for encoding REINSERT
> -
>
> Key: KUDU-237
> URL: https://issues.apache.org/jira/browse/KUDU-237
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Affects Versions: M3
>Reporter: David Alves
>Assignee: David Alves
>
> REINSERTS make us loose all previous row history. In-order for this not to 
> happen we need to store them somehow. The concern is that if stored as 
> regular mutations REINSERTS are rowwise and not column wise, which could 
> represent a serious perf hit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-237) Support for encoding REINSERT

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-237:
-
Target Version/s: GA

> Support for encoding REINSERT
> -
>
> Key: KUDU-237
> URL: https://issues.apache.org/jira/browse/KUDU-237
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Affects Versions: M3
>Reporter: David Alves
>Assignee: David Alves
>
> REINSERTS make us loose all previous row history. In-order for this not to 
> happen we need to store them somehow. The concern is that if stored as 
> regular mutations REINSERTS are rowwise and not column wise, which could 
> represent a serious perf hit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-258) Create an integration test that performs writes with multiple consistency modes

2016-10-14 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576306#comment-15576306
 ] 

David Alves commented on KUDU-258:
--

The gerrit link above is from the dark ages.

Here's the transcript of the non-bot part of the convo that happened there:
↩
Patch Set 1:
hm, when does the clock give you the same timestamp twice? couldn't we make it 
part of the clock contract that successive calls to Now don't return the same 
value?
Todd Lipcon
Apr 28, 2014
↩
Patch Set 1:
also does this have any impact on how we do snapshots for flush? if we're 
assigning "future" timestamps to writes for commit wait, and then we use MVCC 
for flush snapshots, can we miss those writes?
David Alves
Apr 28, 2014
↩
Patch Set 1:
both Now() and NowLatest() are monotonically increasing, but not against each 
other, i.e. say a call to NowLatest() returns 15 (10 + 5 error) later on there 
might be a call to Now() that also returns 15. If both are in-flight at the 
same time, in release more, we get a CHECK error when trying to commit he 
second one.
David Alves
Apr 28, 2014
↩
Patch Set 1:
was worried about that but after thinking about it for a while I think that is 
not a problem. Say we do a commit wait write at NowLatest() = 15, we then take 
the flush snap at Now() = 10 the flush will ignore the commit wait write, the 
commit wait write is still on the in-flights though and the second flush snap 
will include it as an in-flight.
Not sure that was clear, if you want we can discuss this through a hangout or 
something.
Todd Lipcon
Apr 28, 2014
↩
Patch Set 1:
What's the guarantee that the second flush would contain it in the snapshot? 
Couldn't the write still be in the future?
David Alves
Apr 28, 2014
↩
Patch Set 1:
when we prepared it (assigned the commit wait timestamp) it was added to the 
in-flights right? so this case just makes the in-flight interval larger. I.e. 
if we have a bunch of no_consistency txns and a commit_wait txn we might get a 
snapshot like: 10,11,12, 
↩
Patch Set 8: Code-Review+1
Can we write a test for this case? It would either blow up in 
StartTransactionAtLatest() or at commit time without this patch.
David Alves
May 8, 2014
↩
Patch Set 8:
though about it when I submitted this, but we can't do it without KUDU-156 
(mock clock) and AFAIK that is not very high priority wise right now. Will add 
a note on this regard to KUDU-156 though.
Michael Percy
May 8, 2014
↩
Patch Set 8:
Is this fix high priority right now? Why don't we postpone this fix until we do 
the mock clock. I don't see why this has to block consensus going in either.
It's the best kind of bug... when you do something it doesn't like, it crashes.
David Alves
May 8, 2014
↩
Patch Set 8:
cause it's a bug I've seen in the wild? and that bug gets fixed by this change? 
why wouldn't we fix a bug?
Michael Percy
May 8, 2014
↩
Patch Set 8:
Well, it sucks that there's no unit test to verify the fix, that's my main 
concern. LMK if you want to discuss on IRC
David Alves
May 8, 2014
↩
Patch Set 8:
I get that unit tests are important and I try not to add anything without them. 
but seems like there's no good reason to solve a bug I've seen happening (that 
is very rare and only appears when really hammering a multi-machine cluster) 
and that got solved by this otherwise inconsequential 9lines patch.

reproducing some IRC conversation about this:
15:05 < todd> I'm thinking it's OK because the commit that has a "future" 
timestamp will commit-wait
15:05 < todd> so therefore it will be in-flight
15:05 < dralves> right
15:05 < todd> and once it's committed, then the MvccSnapshot will be after it
15:05 < dralves> exactly
15:06 < dralves> all that the mix of consistency levels adds to the snapshots 
stuff is that it makes the interval of 
 in-flight transactions larger
15:07 < dralves> cause we're adding in-flights in the present and in the future
15:07 < dralves> but we never commit in the future, if that makes sense
15:09 < dralves> from another prespective we can think of commit wait 
transactiona as transactions that take a really 
 long time
15:12 < todd> yup
15:17 < todd> dralves: I bet we're going to have some issues with 
component_lock fairness introducing latency
15:17 < todd> unrelated to your patch
15:17 < todd> but if you have a commit-wait txn, let's say it's sleeping 
50ms... then the flush code tries to take 
  the w-lock
15:17 < todd> then any non-commit-wait txns are still blocked from taking the 
lock
15:18 < todd> I think we should eventually fix this by not (ab)using 
component_lock to quiesce txns
15:18 < todd> but instead we just need to do a txn epoch rollover type thing
15:18 < dralves> todd: agreed
15:18 < dralves> that would be true for long running transactions anyway
15:18 < todd> or a "transaction fence"
15:18 < todd> yea
15:18 < dralves> or make the lock a bit more unfair
15:18 < dralves> 

[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576300#comment-15576300
 ] 

Todd Lipcon commented on KUDU-1618:
---

This seems like expected behavior to me. You created a replica on a node that 
was removed from the raft config, so when it starts up, it's confused because 
the metadata says it shouldn't have a replica.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-258) Create an integration test that performs writes with multiple consistency modes

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-258:
-
Target Version/s: GA

> Create an integration test that performs writes with multiple consistency 
> modes
> ---
>
> Key: KUDU-258
> URL: https://issues.apache.org/jira/browse/KUDU-258
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M3
>Reporter: David Alves
>Assignee: David Alves
>
> Right now we test consistency modes independently, but they will eventually 
> coexist and that can spawn trouble (e.g. KUDU-242). We should have an 
> integration test that runs writes on multiple consistency modes at the same 
> time.
> Plus we should have the YCSB run on multiple consistency modes at the same 
> time (need to revive/cleanup what I did for the HT paper)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves resolved KUDU-398.
--
   Resolution: Fixed
Fix Version/s: Public beta

> Snapshot scans should only refuse scans with timestamps whose value is > 
> now+error
> --
>
> Key: KUDU-398
> URL: https://issues.apache.org/jira/browse/KUDU-398
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M4
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: Public beta
>
>
> We currently reject a snapshot scan timestamp if it's value if beyond 
> clock->Now(). We should only reject it if it's value is beyond clock->Now() + 
> error, since all values < clock->Now() + error can still be generated by 
> perfectly valid servers.
> We should wait for the timestamp to be safe in all cases.
> Marking this as best effort as this does not make kudu return wrong values, 
> it just makes it a little less tolerant to skew than it could be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error

2016-10-14 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576295#comment-15576295
 ] 

David Alves commented on KUDU-398:
--

Oh this was merged. Created KUDU-1703 to track handling the arbitrary waiting 
post-clock update that you mentioned in the gerrit.

> Snapshot scans should only refuse scans with timestamps whose value is > 
> now+error
> --
>
> Key: KUDU-398
> URL: https://issues.apache.org/jira/browse/KUDU-398
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M4
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Minor
>
> We currently reject a snapshot scan timestamp if it's value if beyond 
> clock->Now(). We should only reject it if it's value is beyond clock->Now() + 
> error, since all values < clock->Now() + error can still be generated by 
> perfectly valid servers.
> We should wait for the timestamp to be safe in all cases.
> Marking this as best effort as this does not make kudu return wrong values, 
> it just makes it a little less tolerant to skew than it could be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-1703) Handle snapshot reads that might block indefinitely

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1703:
--
Description: 
When we fix safe time advancement, replicas will start to block on snapshot 
scans for arbitrary amounts of time, waiting to have a consistent view of the 
world at that timestamp before serving the scan. This will be a serious problem 
for lagging replicas, which might be several seconds or even minutes behind.

Moreover in the absence of writes, the same will happen even for non-lagging 
replicas, which will have their safe times updated only when the leader 
heartbeats.

We need to at least make sure that:
- Blocked scanner threads are not starving other work.
- If the replica's safe time is lagging by a lot, the replica refuses to do the 
scan.

We might also consider other optimizations (like pinging the leader).


  was:
When we fix safe time advancement, replicas will start to block on snapshot 
scans for arbitrary amounts of time, waiting to have a consistent view of the 
world at that timestamp before serving the scan.

This will be a serious problem for lagging replicas, which might be several 
seconds or even minutes behind. Moreover in the absence of writes, the same 
will happen even for non-lagging replicas, which will have their safe times 
updated only when the leader heartbeats.

We need to at least make sure that:
- Blocked scanner threads are not starving other work.
- If the replica's safe time is lagging by a lot, the replica refuses to do the 
scan.

We might also consider other optimizations (like pinging the leader).


Summary: Handle snapshot reads that might block indefinitely  (was: 
Handle lagging replicas for snapshot reads)

> Handle snapshot reads that might block indefinitely
> ---
>
> Key: KUDU-1703
> URL: https://issues.apache.org/jira/browse/KUDU-1703
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.1.0
>Reporter: David Alves
>Assignee: David Alves
>
> When we fix safe time advancement, replicas will start to block on snapshot 
> scans for arbitrary amounts of time, waiting to have a consistent view of the 
> world at that timestamp before serving the scan. This will be a serious 
> problem for lagging replicas, which might be several seconds or even minutes 
> behind.
> Moreover in the absence of writes, the same will happen even for non-lagging 
> replicas, which will have their safe times updated only when the leader 
> heartbeats.
> We need to at least make sure that:
> - Blocked scanner threads are not starving other work.
> - If the replica's safe time is lagging by a lot, the replica refuses to do 
> the scan.
> We might also consider other optimizations (like pinging the leader).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-1188) For snapshot read correctness, enforce simple form of leader leases

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1188:
--
Component/s: consensus

> For snapshot read correctness, enforce simple form of leader leases
> ---
>
> Key: KUDU-1188
> URL: https://issues.apache.org/jira/browse/KUDU-1188
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, tserver
>Affects Versions: Public beta
>Reporter: David Alves
>Assignee: David Alves
>
> Since raft doesn't allow holes in the log, a new leader is guaranteed to have 
> all the writes that preceded its election and to have them in flight when 
> elected (meaning mvcc will have those transactions in flight, meaning a 
> snapshot read will wait for them to complete). So, for writes, leases aren't 
> really necessary. This is contrary to paxos in spanner where there is no 
> timestamp propagation and the log might have holes and leases are required to 
> enforce write correctness.
> However some form of lease is necessary to enforce read consistency. In 
> particular in the following case:
> Leader A, accepts a write at time 10 which commits and has no following 
> writes, it then serves a snapshot read at 15, and crashed.
> Leader B is elected but has a slow clock which reads 11 when he's ready to 
> serve writes. It then accepts a write at time 13.
> The snapshot read at 15 is now broken.
> A simple form to avoid this is to have each replica promise, on each ack, 
> that if ever elected leader it won't accept writes or serve snapshot read 
> until a certain period, say 2 secs has passed since that ack. On the leader 
> side, the leader is only allowed to serve snapshot read up to 2 seconds since 
> _a majority_ of replicas has ack'd. which in practice means 1 replica usually.
> With such a mechanism in place, if the lease is 5, then leader B wouldn't 
> accept the write at time 13 and would instead wait until 15 had passed, not 
> breaking the snapshot read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-420) Implement HT timestamp propagation for the c++ client

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-420:
-
Target Version/s: GA

> Implement HT timestamp propagation for the c++ client
> -
>
> Key: KUDU-420
> URL: https://issues.apache.org/jira/browse/KUDU-420
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M4
>Reporter: David Alves
>Assignee: David Alves
>
> We're missing hybrid time timestamp propagation for the c++ client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-1189) On reads at a snapshot that touch multiple tablets, without the user setting a timestamp, use the timestamp from the first server for following scans

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1189:
--
Target Version/s: GA  (was: Backlog)

> On reads at a snapshot that touch multiple tablets, without the user setting 
> a timestamp, use the timestamp from the first server for following scans
> -
>
> Key: KUDU-1189
> URL: https://issues.apache.org/jira/browse/KUDU-1189
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: Public beta
>Reporter: David Alves
>Assignee: David Alves
>Priority: Critical
>
> When performing a READ_AT_SNAPSHOT, we allow not to set a timestamp, meaning 
> the server will pick a time. If the scan touches multiple tablets, however, 
> we don't set the timestamp assigned to the first scan on the other scans, 
> meaning each scan will have it's own timstamp, which is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KUDU-931) Address implicit/explicit casts around the slot ref

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves reassigned KUDU-931:


Assignee: (was: David Alves)

> Address implicit/explicit casts around the slot ref
> ---
>
> Key: KUDU-931
> URL: https://issues.apache.org/jira/browse/KUDU-931
> Project: Kudu
>  Issue Type: Improvement
>  Components: impala
>Affects Versions: Feature Complete
>Reporter: David Alves
>
> We should look into what casts we can handle around the slot ref, when 
> pushing predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KUDU-1059) Make Kudu's wire format be compatible with Impala's tuple/row layout

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves reassigned KUDU-1059:
-

Assignee: (was: David Alves)

> Make Kudu's wire format be compatible with Impala's tuple/row layout
> 
>
> Key: KUDU-1059
> URL: https://issues.apache.org/jira/browse/KUDU-1059
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, tserver
>Affects Versions: Feature Complete
>Reporter: David Alves
>
> Kudu's wire format is actually very close to impala's and we should probably 
> take it the rest of the way before we release and start to impact "released" 
> clients.
> The potential performance upside for the kudu-impala integration is pretty 
> big, we can copy whole rows instead of doing tuple by tuple transformations 
> and eventually we can make impala just adopt the data as it arrives from kudu 
> and do no copying or transformations at all.
> Here is the list of things that need addressing:
> - The bitmaps are in opposite sides of the row (Kudu's are at the end and 
> Impala's are at the beginning).
> - Kudu's bitmaps are proportional to the whole column set and contain garbage 
> for non-nullable columns, Impala's bitmaps only refer to the nullable columns 
> (and thus do not contain garbage).
> - Impala's row layout does padding (8 byte alignment). We should mimic that, 
> though it should be optional since it seems like it can be costly space wise.
> - Impala's timestamps have a different size and format from kudu's. We should 
> create rowwiserow blocks with space for impala to do the transformation in 
> place, versus having to memcopy the whole thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Dinesh Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576176#comment-15576176
 ] 

Dinesh Bhat edited comment on KUDU-1618 at 10/14/16 7:29 PM:
-

I was trying to repro an issue where I was not able to do a remote tablet copy 
onto a local_replica if the tablet was DELETE_TOMBSTONED(but has metadata file 
present). However along with the issue reproduction, I saw one state of the 
replica which was confusing. Here are the steps I executed:
1. Bring up a cluster with 1 master, 3 tablet servers hosting 3 tablets, each 
tablet had 3 replicas.
2. There was a standby tserver which was added later.
3. KILL one tserver, after 5 mins, all replicas on that tserver failover to new 
standby with a change_config.
{noformat}
I1013 16:31:48.183486 26604 raft_consensus_state.cc:533] T 
048c7d202da3469eb1b1973df9510007 P b11d2af1457b4542808407b4d4d1bd29 [term 5 
FOLLOWER]: Committing config change with OpId 5.5: config changed from index 4 
to 5, VOTER 19acc272821d425582d3dfb9ed2ab7cd (127.61.33.8) added. New config: { 
opid_index: 5 OBSOLETE_local: false peers { permanent_uuid: 
"9acfc108d9b446c1be783b6d6e7b49ef" member_type: VOTER last_known_addr { host: 
"127.95.58.0" port: 33932 } } peers { permanent_uuid: 
"b11d2af1457b4542808407b4d4d1bd29" member_type: VOTER last_known_addr { host: 
"127.95.58.2" port: 40670 } } peers { permanent_uuid: 
"19acc272821d425582d3dfb9ed2ab7cd" member_type: VOTER last_known_addr { host: 
"127.61.33.8" port: 63532 } } }
I1013 16:31:48.184077 26143 catalog_manager.cc:2800] AddServer ChangeConfig RPC 
for tablet 048c7d202da3469eb1b1973df9510007 on TS 
9acfc108d9b446c1be783b6d6e7b49ef (127.95.58.0:33932) with cas_config_opid_index 
4: Change config succeeded
{noformat}
4. Use 'local_replica copy_from_remote' to copy one tablet replica before 
bringing up, the command fails:
{noformat}
I1013 16:43:41.523896 30948 tablet_copy_service.cc:124] Beginning new tablet 
copy session on tablet 048c7d202da3469eb1b1973df9510007 from peer 
bb2517bc5f2b4980bb07c06019b5a8e9 at {real_user=dinesh, eff_user=} at 
127.61.33.8:40240: session id = 
bb2517bc5f2b4980bb07c06019b5a8e9-048c7d202da3469eb1b1973df9510007
I1013 16:43:41.524291 30948 tablet_copy_session.cc:142] T 
048c7d202da3469eb1b1973df9510007 P 19acc272821d425582d3dfb9ed2ab7cd: Tablet 
Copy: opened 0 blocks and 1 log segments
Already present: Tablet already exists: 048c7d202da3469eb1b1973df9510007
{noformat}
5. Remove the metadata file and WAL log for that tablet, and the 
copy_from_fremote succeeds at this point(expected).
6. Bring up the killed tserver, now all replicas on this are tombstoned except 
one tablet for which we did a copy_from_remote in step 5. Master who was 
incessantly trying to TOMBSTONED the evicted replicas on the tserver which was 
down earlier, throws some interesting log:
{noformat}
[dinesh@ve0518 debug]$ I1013 16:55:54.551717 26141 catalog_manager.cc:2591] 
Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
048c7d202da3469eb1b1973df9510007 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 4)
W1013 16:55:54.552803 26141 catalog_manager.cc:2552] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): delete failed for tablet 
048c7d202da3469eb1b1973df9510007 due to a CAS failure. No further retry: 
Illegal state: Request specified cas_config_opid_index_less_or_equal of -1 but 
the committed config has opid_index of 5
I1013 16:55:54.884133 26141 catalog_manager.cc:2591] Sending 
DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
e9481b695d34483488af07dfb94a8557 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 3)
I1013 16:55:54.885964 26141 catalog_manager.cc:2567] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet 
e9481b695d34483488af07dfb94a8557 (table test-table 
[id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
I1013 16:55:54.915202 26141 catalog_manager.cc:2591] Sending 
DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
e3ff6a1529cf46c5b9787fe322a749e6 on bb2517bc5f2b4980bb07c06019b5a8e9 
(127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new 
config with opid_index 3)
I1013 16:55:54.916774 26141 catalog_manager.cc:2567] TS 
bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet 
e3ff6a1529cf46c5b9787fe322a749e6 (table test-table 
[id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted
{noformat}
7. It continuously spews log messages like this now:
{noformat}
[dinesh@ve0518 debug]$ W1013 16:55:36.608486  6519 raft_consensus.cc:461] T 
048c7d202da3469eb1b1973df9510007 P bb2517bc5f2b4980bb07c06019b5a8e9 [term 5 
NON_PARTICIPANT]: Failed to trigger leader election: Illegal state: Not 
starting election: Node is currently a non-participant in the raft config: 
opid_index: 5 

[jira] [Updated] (KUDU-1704) Add a new read mode to perform bounded staleness snapshot reads

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1704:
--
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-430

> Add a new read mode to perform bounded staleness snapshot reads
> ---
>
> Key: KUDU-1704
> URL: https://issues.apache.org/jira/browse/KUDU-1704
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.1.0
>Reporter: David Alves
>Assignee: David Alves
>
> It would be useful to be able to perform snapshot reads at a timestamp that 
> is higher than a client provided timestamp, thus improving recency, but lower 
> that the server's oldest inflight transaction, thus minimizing the scan's 
> chance to block.
> Such a mode would not guarantee linearizability, but would still allow for 
> client-local read-your-writes, which seems to be one of the properties users 
> care about the most.
> This should likely be the new default read mode for scanners.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-420) Implement HT timestamp propagation for the c++ client

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-420:
-
Issue Type: Sub-task  (was: Task)
Parent: KUDU-430

> Implement HT timestamp propagation for the c++ client
> -
>
> Key: KUDU-420
> URL: https://issues.apache.org/jira/browse/KUDU-420
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M4
>Reporter: David Alves
>Assignee: David Alves
>
> We're missing hybrid time timestamp propagation for the c++ client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KUDU-1368) Setting snapshot timestamp to last propagated timestamp should include prior writes

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves updated KUDU-1368:
--
Issue Type: Sub-task  (was: Bug)
Parent: KUDU-430

> Setting snapshot timestamp to last propagated timestamp should include prior 
> writes
> ---
>
> Key: KUDU-1368
> URL: https://issues.apache.org/jira/browse/KUDU-1368
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 0.7.0
>Reporter: Todd Lipcon
>
> If I do some writes and then use 
> scanner.SetSnapshotRaw(client->GetLastPropagatedTimestamp()), it seems like 
> the snapshot that gets generated does not include the writes I did. I need to 
> add one to get "read your writes", which seems unintuitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error

2016-10-14 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576200#comment-15576200
 ] 

David Alves commented on KUDU-398:
--

This is marked as in-progress, [~tlipcon] did you start to work on this? Or 
should I take it?

> Snapshot scans should only refuse scans with timestamps whose value is > 
> now+error
> --
>
> Key: KUDU-398
> URL: https://issues.apache.org/jira/browse/KUDU-398
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: M4
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Minor
>
> We currently reject a snapshot scan timestamp if it's value if beyond 
> clock->Now(). We should only reject it if it's value is beyond clock->Now() + 
> error, since all values < clock->Now() + error can still be generated by 
> perfectly valid servers.
> We should wait for the timestamp to be safe in all cases.
> Marking this as best effort as this does not make kudu return wrong values, 
> it just makes it a little less tolerant to skew than it could be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KUDU-1703) Handle lagging replicas for snapshot reads

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves reassigned KUDU-1703:
-

Assignee: David Alves

> Handle lagging replicas for snapshot reads
> --
>
> Key: KUDU-1703
> URL: https://issues.apache.org/jira/browse/KUDU-1703
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.1.0
>Reporter: David Alves
>Assignee: David Alves
>
> When we fix safe time advancement, replicas will start to block on snapshot 
> scans for arbitrary amounts of time, waiting to have a consistent view of the 
> world at that timestamp before serving the scan.
> This will be a serious problem for lagging replicas, which might be several 
> seconds or even minutes behind. Moreover in the absence of writes, the same 
> will happen even for non-lagging replicas, which will have their safe times 
> updated only when the leader heartbeats.
> We need to at least make sure that:
> - Blocked scanner threads are not starving other work.
> - If the replica's safe time is lagging by a lot, the replica refuses to do 
> the scan.
> We might also consider other optimizations (like pinging the leader).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KUDU-1679) Propagate timestamps for scans

2016-10-14 Thread David Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Alves reassigned KUDU-1679:
-

Assignee: David Alves

> Propagate timestamps for scans
> --
>
> Key: KUDU-1679
> URL: https://issues.apache.org/jira/browse/KUDU-1679
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: 1.0.1
>Reporter: David Alves
>Assignee: David Alves
>
> We only propagate timestamps from writes to reads, not between two reads. 
> This leaves the door open to unrepeatable read anomalies:
> If T1, T2 are reads from the same client where T2 starts after the response 
> from T1 is received and neither are assigned timestamps by the client. It 
> might be the case where T2’s observed value actually precedes T1’s value in 
> the row history if T1 and T2 are performed in different servers, as T2 can be 
> assigned a timestamp that is lower than T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)