[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica
[ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577292#comment-15577292 ] Todd Lipcon commented on KUDU-1618: --- bq. Todd Lipcon thanks for a quick reply, by 'shouldn't have a replica' in above comment, you meant current tablet server where we are trying to bring up the replica, is not part of raft config for that tablet anymore right ? It has other tservers as replicas at this point. That makes sense. I believe tserver keeps trying until there may be another change_config in future which brings in this tserver as replica for that tablet. Right, when you copied the replica it copied the new configuration, and it's not a part of that configuration. So, it knows that it shouldn't try to get the other nodes to vote for it. It would be reasonable to say that we should detect this scenario and mark the tablet as 'failed', but it's actually somewhat useful occasionally -- eg I've used this before to copy a tablet from a running cluster onto my laptop so I could then use tools like 'dump_tablet' against it locally. Given that I don't think you can get into this state without using explicit tablet copy repair tools, I don't think it should really be considered a bug. bq. One follow up Qn is: What state should the replica be in after step 6 ? I see it in RUNNING state, which was slightly confusing, because this replica isn't an active replica at this point. The tablet's state is referring more to the data layer. It's up and running, it has replayed its log, it has valid data, etc. So it's RUNNING even though it's not actually an active part of any raft configuration. If you run ksck on this cluster does ksck report the "extra" replica anywhere? That might be a useful thing to do so we can detect if this ever happens in real life. > Add local_replica tool to delete a replica > -- > > Key: KUDU-1618 > URL: https://issues.apache.org/jira/browse/KUDU-1618 > Project: Kudu > Issue Type: Improvement > Components: ops-tooling >Affects Versions: 1.0.0 >Reporter: Todd Lipcon >Assignee: Dinesh Bhat > > Occasionally we've hit cases where a tablet is corrupt in such a way that the > tserver fails to start or crashes soon after starting. Typically we'd prefer > the tablet just get marked FAILED but in the worst case it causes the whole > tserver to fail. > For these cases we should add a 'local_replica' subtool to fully remove a > local tablet. Related, it might be useful to have a 'local_replica archive' > which would create a tarball from the data in this tablet for later > examination by developers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-1702) Document/Implement read-your-writes for Impala/Spark etc.
[ https://issues.apache.org/jira/browse/KUDU-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1702: -- Description: Engines like Impala/Spark use many independent client instances, so we should provide a way to have read-your-writes across many independent client instances, which translates to providing a way to get linearizable behavior. At first this can be done using the APIs that are already available. For instance if the objective is to be sure to have the results of a write in a a following scan, the following steps can be taken: - After a write the engine should collect the last observed timestamps from kudu clients - The engine's coordinator then takes the max of those timestamps, adds 1 and uses that as a snapshot scan timestamp. One important pre-requisite of the behavior above is that scans be done in READ_AT_SNAPSHOT mode. Also the steps above currently don't actually guarantee the expected behavior, but should as the currently anomalies are taken care of (as part of KUDU-430). In the immediate future we'll add APIs to the Kudu client so as to make the inner workings of getting this behavior oblivious to the engine. The steps will still be the same, i.e. timestamps or timestamp tokens will still be passed around, but the kudu client will encapsulate the choice of the timestamp for the scan. Later we will add a way to obtain this behavior without timestamp propagation, either by doing a write-side commit-wait, where clients wait out the clock error after/during the last write thus making sure any future operation will have a higher timestamp; or by making read-side commit wait, where we provide an api on the kudu client for the engine to perform a similar call before the scan call to obtain a scan timestamp. was: Engines like Impala/Spark use many independent client instances, so we should provide a way to have read-your-writes across many independent client instances, which translates to provide a way to get linearizable behavior. At first this can be done using the APIs that are already available. For instance if the objective is to be sure to have the results of a write in a a following scan, the following steps can be taken: - After a write the engine should collect the last observed timestamps from kudu clients - The engine's coordinator then takes the max of those timestamps, adds 1 and uses that as a snapshot scan timestamp. One important pre-requisite of the behavior above is that scans be done in READ_AT_SNAPSHOT mode. Also the steps above currently don't actually guarantee the expected behavior, but should as the currently anomalies are taken care of (as part of KUDU-430). In the immediate future we'll add APIs to the Kudu client so as to make the inner workings of getting this behavior oblivious to the engine. The steps will still be the same, i.e. timestamps or timestamp tokens will still be passed around, but the kudu client will encapsulate the choice of the timestamp for the scan. Later we will add a way to obtain this behavior without timestamp propagation, either by doing a write-side commit-wait, where clients wait out the clock error after/during the last write thus making sure any future operation will have a higher timestamp; or by making read-side commit wait, where we provide an api on the kudu client for the engine to perform a similar call before the scan call to obtain a scan timestamp. > Document/Implement read-your-writes for Impala/Spark etc. > - > > Key: KUDU-1702 > URL: https://issues.apache.org/jira/browse/KUDU-1702 > Project: Kudu > Issue Type: Sub-task > Components: client, tablet, tserver >Affects Versions: 1.1.0 >Reporter: David Alves >Assignee: David Alves > > Engines like Impala/Spark use many independent client instances, so we should > provide a way to have read-your-writes across many independent client > instances, which translates to providing a way to get linearizable behavior. > At first this can be done using the APIs that are already available. For > instance if the objective is to be sure to have the results of a write in a a > following scan, the following steps can be taken: > - After a write the engine should collect the last observed timestamps from > kudu clients > - The engine's coordinator then takes the max of those timestamps, adds 1 and > uses that as a snapshot scan timestamp. > One important pre-requisite of the behavior above is that scans be done in > READ_AT_SNAPSHOT mode. Also the steps above currently don't actually > guarantee the expected behavior, but should as the currently anomalies are > taken care of (as part of KUDU-430). > In the immediate future we'll add APIs to the Kudu client so as to make the > inner workings of getting this
[jira] [Resolved] (KUDU-1368) Setting snapshot timestamp to last propagated timestamp should include prior writes
[ https://issues.apache.org/jira/browse/KUDU-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves resolved KUDU-1368. --- Resolution: Won't Fix Fix Version/s: 1.1.0 Setting the snapshot scan's timestamp based on the last observer timestamp + 1 is a hack we came up for tests, to make sure we would get RYW at the same time minimizing the server's chance to block on the scan (because of in-flights or safe time advancement). When the GA consistency work is done, this will no longer be necessary. KUDU-1679 will make sure that, even for the current snapshot scans with no provided timestamp ("now" scans), the timestamp taken by the server for the read will be higher than the time of the last write (which was not guaranteed before). KUDU-1704 Will improve on this further by allowing bounded staleness scans, going a bit further than the hack since it will allow to read more recent data. > Setting snapshot timestamp to last propagated timestamp should include prior > writes > --- > > Key: KUDU-1368 > URL: https://issues.apache.org/jira/browse/KUDU-1368 > Project: Kudu > Issue Type: Sub-task > Components: client >Affects Versions: 0.7.0 >Reporter: Todd Lipcon > Fix For: 1.1.0 > > > If I do some writes and then use > scanner.SetSnapshotRaw(client->GetLastPropagatedTimestamp()), it seems like > the snapshot that gets generated does not include the writes I did. I need to > add one to get "read your writes", which seems unintuitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica
[ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576389#comment-15576389 ] Dinesh Bhat commented on KUDU-1618: --- [~tlipcon] thanks for a quick reply, by 'shouldn't have a replica' in above comment, you meant current tablet server where we are trying to bring up the replica, is not part of raft config for that tablet anymore right ? It has other tservers as replicas at this point. That makes sense. I believe tserver keeps trying until there may be another change_config in future which brings in this tserver as replica for that tablet. One follow up Qn is: What state should the replica be in after step 6 ? I see it in RUNNING state, which was slightly confusing, because this replica isn't an active replica at this point. > Add local_replica tool to delete a replica > -- > > Key: KUDU-1618 > URL: https://issues.apache.org/jira/browse/KUDU-1618 > Project: Kudu > Issue Type: Improvement > Components: ops-tooling >Affects Versions: 1.0.0 >Reporter: Todd Lipcon >Assignee: Dinesh Bhat > > Occasionally we've hit cases where a tablet is corrupt in such a way that the > tserver fails to start or crashes soon after starting. Typically we'd prefer > the tablet just get marked FAILED but in the worst case it causes the whole > tserver to fail. > For these cases we should add a 'local_replica' subtool to fully remove a > local tablet. Related, it might be useful to have a 'local_replica archive' > which would create a tarball from the data in this tablet for later > examination by developers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-430) Consistent Operations
[ https://issues.apache.org/jira/browse/KUDU-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-430: - Description: This ticket tracks consistency/isolation work for GA. Scope Doc: https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit# The sub-tasks that don't target GA will likely be moved somewhere else, or promoted to tasks once this ticket is done, but for now it's handy to have a single view of all the remaining work was: A number of small subtasks remain before we fully support snapshot consistency. In particular, a few of the issues: - right now, after compactions, we can lose history for a given row, and then a snapshot read in the past wouldn't produce correct results. - the C++ client doesn't handle timestamp propagation - we need to evaluate and make sure all of our APIs are in good shape in both Java and C++ clients - need to add some security (hashes) around timestamp propagation to prevent malicious clients from mucking with our machinery Scope Doc: https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit# > Consistent Operations > - > > Key: KUDU-430 > URL: https://issues.apache.org/jira/browse/KUDU-430 > Project: Kudu > Issue Type: New Feature > Components: client, tablet, tserver >Affects Versions: M4 >Reporter: Todd Lipcon >Assignee: David Alves > Labels: kudu-roadmap > > This ticket tracks consistency/isolation work for GA. > Scope Doc: > https://docs.google.com/document/d/1EaKlJyQdMBz6G-Xn5uktY-d_x0uRmjMCrDGP5rZ7AoI/edit# > The sub-tasks that don't target GA will likely be moved somewhere else, or > promoted to tasks once this ticket is done, but for now it's handy to have a > single view of all the remaining work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-237) Support for encoding REINSERT
[ https://issues.apache.org/jira/browse/KUDU-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576318#comment-15576318 ] David Alves commented on KUDU-237: -- I'm tentatively targeting this for GA. An old patch is available at: https://gerrit.cloudera.org/#/c/4627/ > Support for encoding REINSERT > - > > Key: KUDU-237 > URL: https://issues.apache.org/jira/browse/KUDU-237 > Project: Kudu > Issue Type: Sub-task > Components: tablet >Affects Versions: M3 >Reporter: David Alves >Assignee: David Alves > > REINSERTS make us loose all previous row history. In-order for this not to > happen we need to store them somehow. The concern is that if stored as > regular mutations REINSERTS are rowwise and not column wise, which could > represent a serious perf hit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-237) Support for encoding REINSERT
[ https://issues.apache.org/jira/browse/KUDU-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-237: - Target Version/s: GA > Support for encoding REINSERT > - > > Key: KUDU-237 > URL: https://issues.apache.org/jira/browse/KUDU-237 > Project: Kudu > Issue Type: Sub-task > Components: tablet >Affects Versions: M3 >Reporter: David Alves >Assignee: David Alves > > REINSERTS make us loose all previous row history. In-order for this not to > happen we need to store them somehow. The concern is that if stored as > regular mutations REINSERTS are rowwise and not column wise, which could > represent a serious perf hit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-258) Create an integration test that performs writes with multiple consistency modes
[ https://issues.apache.org/jira/browse/KUDU-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576306#comment-15576306 ] David Alves commented on KUDU-258: -- The gerrit link above is from the dark ages. Here's the transcript of the non-bot part of the convo that happened there: ↩ Patch Set 1: hm, when does the clock give you the same timestamp twice? couldn't we make it part of the clock contract that successive calls to Now don't return the same value? Todd Lipcon Apr 28, 2014 ↩ Patch Set 1: also does this have any impact on how we do snapshots for flush? if we're assigning "future" timestamps to writes for commit wait, and then we use MVCC for flush snapshots, can we miss those writes? David Alves Apr 28, 2014 ↩ Patch Set 1: both Now() and NowLatest() are monotonically increasing, but not against each other, i.e. say a call to NowLatest() returns 15 (10 + 5 error) later on there might be a call to Now() that also returns 15. If both are in-flight at the same time, in release more, we get a CHECK error when trying to commit he second one. David Alves Apr 28, 2014 ↩ Patch Set 1: was worried about that but after thinking about it for a while I think that is not a problem. Say we do a commit wait write at NowLatest() = 15, we then take the flush snap at Now() = 10 the flush will ignore the commit wait write, the commit wait write is still on the in-flights though and the second flush snap will include it as an in-flight. Not sure that was clear, if you want we can discuss this through a hangout or something. Todd Lipcon Apr 28, 2014 ↩ Patch Set 1: What's the guarantee that the second flush would contain it in the snapshot? Couldn't the write still be in the future? David Alves Apr 28, 2014 ↩ Patch Set 1: when we prepared it (assigned the commit wait timestamp) it was added to the in-flights right? so this case just makes the in-flight interval larger. I.e. if we have a bunch of no_consistency txns and a commit_wait txn we might get a snapshot like: 10,11,12, ↩ Patch Set 8: Code-Review+1 Can we write a test for this case? It would either blow up in StartTransactionAtLatest() or at commit time without this patch. David Alves May 8, 2014 ↩ Patch Set 8: though about it when I submitted this, but we can't do it without KUDU-156 (mock clock) and AFAIK that is not very high priority wise right now. Will add a note on this regard to KUDU-156 though. Michael Percy May 8, 2014 ↩ Patch Set 8: Is this fix high priority right now? Why don't we postpone this fix until we do the mock clock. I don't see why this has to block consensus going in either. It's the best kind of bug... when you do something it doesn't like, it crashes. David Alves May 8, 2014 ↩ Patch Set 8: cause it's a bug I've seen in the wild? and that bug gets fixed by this change? why wouldn't we fix a bug? Michael Percy May 8, 2014 ↩ Patch Set 8: Well, it sucks that there's no unit test to verify the fix, that's my main concern. LMK if you want to discuss on IRC David Alves May 8, 2014 ↩ Patch Set 8: I get that unit tests are important and I try not to add anything without them. but seems like there's no good reason to solve a bug I've seen happening (that is very rare and only appears when really hammering a multi-machine cluster) and that got solved by this otherwise inconsequential 9lines patch. reproducing some IRC conversation about this: 15:05 < todd> I'm thinking it's OK because the commit that has a "future" timestamp will commit-wait 15:05 < todd> so therefore it will be in-flight 15:05 < dralves> right 15:05 < todd> and once it's committed, then the MvccSnapshot will be after it 15:05 < dralves> exactly 15:06 < dralves> all that the mix of consistency levels adds to the snapshots stuff is that it makes the interval of in-flight transactions larger 15:07 < dralves> cause we're adding in-flights in the present and in the future 15:07 < dralves> but we never commit in the future, if that makes sense 15:09 < dralves> from another prespective we can think of commit wait transactiona as transactions that take a really long time 15:12 < todd> yup 15:17 < todd> dralves: I bet we're going to have some issues with component_lock fairness introducing latency 15:17 < todd> unrelated to your patch 15:17 < todd> but if you have a commit-wait txn, let's say it's sleeping 50ms... then the flush code tries to take the w-lock 15:17 < todd> then any non-commit-wait txns are still blocked from taking the lock 15:18 < todd> I think we should eventually fix this by not (ab)using component_lock to quiesce txns 15:18 < todd> but instead we just need to do a txn epoch rollover type thing 15:18 < dralves> todd: agreed 15:18 < dralves> that would be true for long running transactions anyway 15:18 < todd> or a "transaction fence" 15:18 < todd> yea 15:18 < dralves> or make the lock a bit more unfair 15:18 < dralves>
[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica
[ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576300#comment-15576300 ] Todd Lipcon commented on KUDU-1618: --- This seems like expected behavior to me. You created a replica on a node that was removed from the raft config, so when it starts up, it's confused because the metadata says it shouldn't have a replica. > Add local_replica tool to delete a replica > -- > > Key: KUDU-1618 > URL: https://issues.apache.org/jira/browse/KUDU-1618 > Project: Kudu > Issue Type: Improvement > Components: ops-tooling >Affects Versions: 1.0.0 >Reporter: Todd Lipcon >Assignee: Dinesh Bhat > > Occasionally we've hit cases where a tablet is corrupt in such a way that the > tserver fails to start or crashes soon after starting. Typically we'd prefer > the tablet just get marked FAILED but in the worst case it causes the whole > tserver to fail. > For these cases we should add a 'local_replica' subtool to fully remove a > local tablet. Related, it might be useful to have a 'local_replica archive' > which would create a tarball from the data in this tablet for later > examination by developers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-258) Create an integration test that performs writes with multiple consistency modes
[ https://issues.apache.org/jira/browse/KUDU-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-258: - Target Version/s: GA > Create an integration test that performs writes with multiple consistency > modes > --- > > Key: KUDU-258 > URL: https://issues.apache.org/jira/browse/KUDU-258 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M3 >Reporter: David Alves >Assignee: David Alves > > Right now we test consistency modes independently, but they will eventually > coexist and that can spawn trouble (e.g. KUDU-242). We should have an > integration test that runs writes on multiple consistency modes at the same > time. > Plus we should have the YCSB run on multiple consistency modes at the same > time (need to revive/cleanup what I did for the HT paper) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error
[ https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves resolved KUDU-398. -- Resolution: Fixed Fix Version/s: Public beta > Snapshot scans should only refuse scans with timestamps whose value is > > now+error > -- > > Key: KUDU-398 > URL: https://issues.apache.org/jira/browse/KUDU-398 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M4 >Reporter: David Alves >Assignee: Todd Lipcon >Priority: Minor > Fix For: Public beta > > > We currently reject a snapshot scan timestamp if it's value if beyond > clock->Now(). We should only reject it if it's value is beyond clock->Now() + > error, since all values < clock->Now() + error can still be generated by > perfectly valid servers. > We should wait for the timestamp to be safe in all cases. > Marking this as best effort as this does not make kudu return wrong values, > it just makes it a little less tolerant to skew than it could be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error
[ https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576295#comment-15576295 ] David Alves commented on KUDU-398: -- Oh this was merged. Created KUDU-1703 to track handling the arbitrary waiting post-clock update that you mentioned in the gerrit. > Snapshot scans should only refuse scans with timestamps whose value is > > now+error > -- > > Key: KUDU-398 > URL: https://issues.apache.org/jira/browse/KUDU-398 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M4 >Reporter: David Alves >Assignee: Todd Lipcon >Priority: Minor > > We currently reject a snapshot scan timestamp if it's value if beyond > clock->Now(). We should only reject it if it's value is beyond clock->Now() + > error, since all values < clock->Now() + error can still be generated by > perfectly valid servers. > We should wait for the timestamp to be safe in all cases. > Marking this as best effort as this does not make kudu return wrong values, > it just makes it a little less tolerant to skew than it could be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-1703) Handle snapshot reads that might block indefinitely
[ https://issues.apache.org/jira/browse/KUDU-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1703: -- Description: When we fix safe time advancement, replicas will start to block on snapshot scans for arbitrary amounts of time, waiting to have a consistent view of the world at that timestamp before serving the scan. This will be a serious problem for lagging replicas, which might be several seconds or even minutes behind. Moreover in the absence of writes, the same will happen even for non-lagging replicas, which will have their safe times updated only when the leader heartbeats. We need to at least make sure that: - Blocked scanner threads are not starving other work. - If the replica's safe time is lagging by a lot, the replica refuses to do the scan. We might also consider other optimizations (like pinging the leader). was: When we fix safe time advancement, replicas will start to block on snapshot scans for arbitrary amounts of time, waiting to have a consistent view of the world at that timestamp before serving the scan. This will be a serious problem for lagging replicas, which might be several seconds or even minutes behind. Moreover in the absence of writes, the same will happen even for non-lagging replicas, which will have their safe times updated only when the leader heartbeats. We need to at least make sure that: - Blocked scanner threads are not starving other work. - If the replica's safe time is lagging by a lot, the replica refuses to do the scan. We might also consider other optimizations (like pinging the leader). Summary: Handle snapshot reads that might block indefinitely (was: Handle lagging replicas for snapshot reads) > Handle snapshot reads that might block indefinitely > --- > > Key: KUDU-1703 > URL: https://issues.apache.org/jira/browse/KUDU-1703 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.1.0 >Reporter: David Alves >Assignee: David Alves > > When we fix safe time advancement, replicas will start to block on snapshot > scans for arbitrary amounts of time, waiting to have a consistent view of the > world at that timestamp before serving the scan. This will be a serious > problem for lagging replicas, which might be several seconds or even minutes > behind. > Moreover in the absence of writes, the same will happen even for non-lagging > replicas, which will have their safe times updated only when the leader > heartbeats. > We need to at least make sure that: > - Blocked scanner threads are not starving other work. > - If the replica's safe time is lagging by a lot, the replica refuses to do > the scan. > We might also consider other optimizations (like pinging the leader). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-1188) For snapshot read correctness, enforce simple form of leader leases
[ https://issues.apache.org/jira/browse/KUDU-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1188: -- Component/s: consensus > For snapshot read correctness, enforce simple form of leader leases > --- > > Key: KUDU-1188 > URL: https://issues.apache.org/jira/browse/KUDU-1188 > Project: Kudu > Issue Type: Sub-task > Components: consensus, tserver >Affects Versions: Public beta >Reporter: David Alves >Assignee: David Alves > > Since raft doesn't allow holes in the log, a new leader is guaranteed to have > all the writes that preceded its election and to have them in flight when > elected (meaning mvcc will have those transactions in flight, meaning a > snapshot read will wait for them to complete). So, for writes, leases aren't > really necessary. This is contrary to paxos in spanner where there is no > timestamp propagation and the log might have holes and leases are required to > enforce write correctness. > However some form of lease is necessary to enforce read consistency. In > particular in the following case: > Leader A, accepts a write at time 10 which commits and has no following > writes, it then serves a snapshot read at 15, and crashed. > Leader B is elected but has a slow clock which reads 11 when he's ready to > serve writes. It then accepts a write at time 13. > The snapshot read at 15 is now broken. > A simple form to avoid this is to have each replica promise, on each ack, > that if ever elected leader it won't accept writes or serve snapshot read > until a certain period, say 2 secs has passed since that ack. On the leader > side, the leader is only allowed to serve snapshot read up to 2 seconds since > _a majority_ of replicas has ack'd. which in practice means 1 replica usually. > With such a mechanism in place, if the lease is 5, then leader B wouldn't > accept the write at time 13 and would instead wait until 15 had passed, not > breaking the snapshot read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-420) Implement HT timestamp propagation for the c++ client
[ https://issues.apache.org/jira/browse/KUDU-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-420: - Target Version/s: GA > Implement HT timestamp propagation for the c++ client > - > > Key: KUDU-420 > URL: https://issues.apache.org/jira/browse/KUDU-420 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M4 >Reporter: David Alves >Assignee: David Alves > > We're missing hybrid time timestamp propagation for the c++ client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-1189) On reads at a snapshot that touch multiple tablets, without the user setting a timestamp, use the timestamp from the first server for following scans
[ https://issues.apache.org/jira/browse/KUDU-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1189: -- Target Version/s: GA (was: Backlog) > On reads at a snapshot that touch multiple tablets, without the user setting > a timestamp, use the timestamp from the first server for following scans > - > > Key: KUDU-1189 > URL: https://issues.apache.org/jira/browse/KUDU-1189 > Project: Kudu > Issue Type: Sub-task > Components: client >Affects Versions: Public beta >Reporter: David Alves >Assignee: David Alves >Priority: Critical > > When performing a READ_AT_SNAPSHOT, we allow not to set a timestamp, meaning > the server will pick a time. If the scan touches multiple tablets, however, > we don't set the timestamp assigned to the first scan on the other scans, > meaning each scan will have it's own timstamp, which is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (KUDU-931) Address implicit/explicit casts around the slot ref
[ https://issues.apache.org/jira/browse/KUDU-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves reassigned KUDU-931: Assignee: (was: David Alves) > Address implicit/explicit casts around the slot ref > --- > > Key: KUDU-931 > URL: https://issues.apache.org/jira/browse/KUDU-931 > Project: Kudu > Issue Type: Improvement > Components: impala >Affects Versions: Feature Complete >Reporter: David Alves > > We should look into what casts we can handle around the slot ref, when > pushing predicates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (KUDU-1059) Make Kudu's wire format be compatible with Impala's tuple/row layout
[ https://issues.apache.org/jira/browse/KUDU-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves reassigned KUDU-1059: - Assignee: (was: David Alves) > Make Kudu's wire format be compatible with Impala's tuple/row layout > > > Key: KUDU-1059 > URL: https://issues.apache.org/jira/browse/KUDU-1059 > Project: Kudu > Issue Type: Improvement > Components: client, tserver >Affects Versions: Feature Complete >Reporter: David Alves > > Kudu's wire format is actually very close to impala's and we should probably > take it the rest of the way before we release and start to impact "released" > clients. > The potential performance upside for the kudu-impala integration is pretty > big, we can copy whole rows instead of doing tuple by tuple transformations > and eventually we can make impala just adopt the data as it arrives from kudu > and do no copying or transformations at all. > Here is the list of things that need addressing: > - The bitmaps are in opposite sides of the row (Kudu's are at the end and > Impala's are at the beginning). > - Kudu's bitmaps are proportional to the whole column set and contain garbage > for non-nullable columns, Impala's bitmaps only refer to the nullable columns > (and thus do not contain garbage). > - Impala's row layout does padding (8 byte alignment). We should mimic that, > though it should be optional since it seems like it can be costly space wise. > - Impala's timestamps have a different size and format from kudu's. We should > create rowwiserow blocks with space for impala to do the transformation in > place, versus having to memcopy the whole thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KUDU-1618) Add local_replica tool to delete a replica
[ https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576176#comment-15576176 ] Dinesh Bhat edited comment on KUDU-1618 at 10/14/16 7:29 PM: - I was trying to repro an issue where I was not able to do a remote tablet copy onto a local_replica if the tablet was DELETE_TOMBSTONED(but has metadata file present). However along with the issue reproduction, I saw one state of the replica which was confusing. Here are the steps I executed: 1. Bring up a cluster with 1 master, 3 tablet servers hosting 3 tablets, each tablet had 3 replicas. 2. There was a standby tserver which was added later. 3. KILL one tserver, after 5 mins, all replicas on that tserver failover to new standby with a change_config. {noformat} I1013 16:31:48.183486 26604 raft_consensus_state.cc:533] T 048c7d202da3469eb1b1973df9510007 P b11d2af1457b4542808407b4d4d1bd29 [term 5 FOLLOWER]: Committing config change with OpId 5.5: config changed from index 4 to 5, VOTER 19acc272821d425582d3dfb9ed2ab7cd (127.61.33.8) added. New config: { opid_index: 5 OBSOLETE_local: false peers { permanent_uuid: "9acfc108d9b446c1be783b6d6e7b49ef" member_type: VOTER last_known_addr { host: "127.95.58.0" port: 33932 } } peers { permanent_uuid: "b11d2af1457b4542808407b4d4d1bd29" member_type: VOTER last_known_addr { host: "127.95.58.2" port: 40670 } } peers { permanent_uuid: "19acc272821d425582d3dfb9ed2ab7cd" member_type: VOTER last_known_addr { host: "127.61.33.8" port: 63532 } } } I1013 16:31:48.184077 26143 catalog_manager.cc:2800] AddServer ChangeConfig RPC for tablet 048c7d202da3469eb1b1973df9510007 on TS 9acfc108d9b446c1be783b6d6e7b49ef (127.95.58.0:33932) with cas_config_opid_index 4: Change config succeeded {noformat} 4. Use 'local_replica copy_from_remote' to copy one tablet replica before bringing up, the command fails: {noformat} I1013 16:43:41.523896 30948 tablet_copy_service.cc:124] Beginning new tablet copy session on tablet 048c7d202da3469eb1b1973df9510007 from peer bb2517bc5f2b4980bb07c06019b5a8e9 at {real_user=dinesh, eff_user=} at 127.61.33.8:40240: session id = bb2517bc5f2b4980bb07c06019b5a8e9-048c7d202da3469eb1b1973df9510007 I1013 16:43:41.524291 30948 tablet_copy_session.cc:142] T 048c7d202da3469eb1b1973df9510007 P 19acc272821d425582d3dfb9ed2ab7cd: Tablet Copy: opened 0 blocks and 1 log segments Already present: Tablet already exists: 048c7d202da3469eb1b1973df9510007 {noformat} 5. Remove the metadata file and WAL log for that tablet, and the copy_from_fremote succeeds at this point(expected). 6. Bring up the killed tserver, now all replicas on this are tombstoned except one tablet for which we did a copy_from_remote in step 5. Master who was incessantly trying to TOMBSTONED the evicted replicas on the tserver which was down earlier, throws some interesting log: {noformat} [dinesh@ve0518 debug]$ I1013 16:55:54.551717 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 048c7d202da3469eb1b1973df9510007 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 4) W1013 16:55:54.552803 26141 catalog_manager.cc:2552] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): delete failed for tablet 048c7d202da3469eb1b1973df9510007 due to a CAS failure. No further retry: Illegal state: Request specified cas_config_opid_index_less_or_equal of -1 but the committed config has opid_index of 5 I1013 16:55:54.884133 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e9481b695d34483488af07dfb94a8557 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3) I1013 16:55:54.885964 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e9481b695d34483488af07dfb94a8557 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted I1013 16:55:54.915202 26141 catalog_manager.cc:2591] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet e3ff6a1529cf46c5b9787fe322a749e6 on bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867) (TS bb2517bc5f2b4980bb07c06019b5a8e9 not found in new config with opid_index 3) I1013 16:55:54.916774 26141 catalog_manager.cc:2567] TS bb2517bc5f2b4980bb07c06019b5a8e9 (127.95.58.1:40867): tablet e3ff6a1529cf46c5b9787fe322a749e6 (table test-table [id=ca8f507e47684ddfa147e2cd232ed773]) successfully deleted {noformat} 7. It continuously spews log messages like this now: {noformat} [dinesh@ve0518 debug]$ W1013 16:55:36.608486 6519 raft_consensus.cc:461] T 048c7d202da3469eb1b1973df9510007 P bb2517bc5f2b4980bb07c06019b5a8e9 [term 5 NON_PARTICIPANT]: Failed to trigger leader election: Illegal state: Not starting election: Node is currently a non-participant in the raft config: opid_index: 5
[jira] [Updated] (KUDU-1704) Add a new read mode to perform bounded staleness snapshot reads
[ https://issues.apache.org/jira/browse/KUDU-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1704: -- Issue Type: Sub-task (was: Improvement) Parent: KUDU-430 > Add a new read mode to perform bounded staleness snapshot reads > --- > > Key: KUDU-1704 > URL: https://issues.apache.org/jira/browse/KUDU-1704 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.1.0 >Reporter: David Alves >Assignee: David Alves > > It would be useful to be able to perform snapshot reads at a timestamp that > is higher than a client provided timestamp, thus improving recency, but lower > that the server's oldest inflight transaction, thus minimizing the scan's > chance to block. > Such a mode would not guarantee linearizability, but would still allow for > client-local read-your-writes, which seems to be one of the properties users > care about the most. > This should likely be the new default read mode for scanners. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-420) Implement HT timestamp propagation for the c++ client
[ https://issues.apache.org/jira/browse/KUDU-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-420: - Issue Type: Sub-task (was: Task) Parent: KUDU-430 > Implement HT timestamp propagation for the c++ client > - > > Key: KUDU-420 > URL: https://issues.apache.org/jira/browse/KUDU-420 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M4 >Reporter: David Alves >Assignee: David Alves > > We're missing hybrid time timestamp propagation for the c++ client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-1368) Setting snapshot timestamp to last propagated timestamp should include prior writes
[ https://issues.apache.org/jira/browse/KUDU-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves updated KUDU-1368: -- Issue Type: Sub-task (was: Bug) Parent: KUDU-430 > Setting snapshot timestamp to last propagated timestamp should include prior > writes > --- > > Key: KUDU-1368 > URL: https://issues.apache.org/jira/browse/KUDU-1368 > Project: Kudu > Issue Type: Sub-task > Components: client >Affects Versions: 0.7.0 >Reporter: Todd Lipcon > > If I do some writes and then use > scanner.SetSnapshotRaw(client->GetLastPropagatedTimestamp()), it seems like > the snapshot that gets generated does not include the writes I did. I need to > add one to get "read your writes", which seems unintuitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-398) Snapshot scans should only refuse scans with timestamps whose value is > now+error
[ https://issues.apache.org/jira/browse/KUDU-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576200#comment-15576200 ] David Alves commented on KUDU-398: -- This is marked as in-progress, [~tlipcon] did you start to work on this? Or should I take it? > Snapshot scans should only refuse scans with timestamps whose value is > > now+error > -- > > Key: KUDU-398 > URL: https://issues.apache.org/jira/browse/KUDU-398 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: M4 >Reporter: David Alves >Assignee: Todd Lipcon >Priority: Minor > > We currently reject a snapshot scan timestamp if it's value if beyond > clock->Now(). We should only reject it if it's value is beyond clock->Now() + > error, since all values < clock->Now() + error can still be generated by > perfectly valid servers. > We should wait for the timestamp to be safe in all cases. > Marking this as best effort as this does not make kudu return wrong values, > it just makes it a little less tolerant to skew than it could be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (KUDU-1703) Handle lagging replicas for snapshot reads
[ https://issues.apache.org/jira/browse/KUDU-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves reassigned KUDU-1703: - Assignee: David Alves > Handle lagging replicas for snapshot reads > -- > > Key: KUDU-1703 > URL: https://issues.apache.org/jira/browse/KUDU-1703 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.1.0 >Reporter: David Alves >Assignee: David Alves > > When we fix safe time advancement, replicas will start to block on snapshot > scans for arbitrary amounts of time, waiting to have a consistent view of the > world at that timestamp before serving the scan. > This will be a serious problem for lagging replicas, which might be several > seconds or even minutes behind. Moreover in the absence of writes, the same > will happen even for non-lagging replicas, which will have their safe times > updated only when the leader heartbeats. > We need to at least make sure that: > - Blocked scanner threads are not starving other work. > - If the replica's safe time is lagging by a lot, the replica refuses to do > the scan. > We might also consider other optimizations (like pinging the leader). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (KUDU-1679) Propagate timestamps for scans
[ https://issues.apache.org/jira/browse/KUDU-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Alves reassigned KUDU-1679: - Assignee: David Alves > Propagate timestamps for scans > -- > > Key: KUDU-1679 > URL: https://issues.apache.org/jira/browse/KUDU-1679 > Project: Kudu > Issue Type: Sub-task > Components: tserver >Affects Versions: 1.0.1 >Reporter: David Alves >Assignee: David Alves > > We only propagate timestamps from writes to reads, not between two reads. > This leaves the door open to unrepeatable read anomalies: > If T1, T2 are reads from the same client where T2 starts after the response > from T1 is received and neither are assigned timestamps by the client. It > might be the case where T2’s observed value actually precedes T1’s value in > the row history if T1 and T2 are performed in different servers, as T2 can be > assigned a timestamp that is lower than T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)