apiserver/testing.FakeAuthoriser is not a good mock
Hello, This message serves as background to a series of changes coming soon. The goal of this message, and the changes that follow, is to describe the problems that I see with the FakeAuthoriser and a describe in general terms the steps I propose to address those shortcomings. Background: The FakeAuthoriser implements the apiserver/common.Authoriser interface and is used heavily in the apiserver tests. The other implementation of this interface is the apiserver itself, specifically, apiserver.srvRoot. tf. The goal of the FakeAuthoriser is to imitate apiserver.svrRoot. I assert that this is currently not true and that the tests are structured to expect the behaviour of the FakeAuthorizer rather than the contract of apiserver/common.Authoriser as defined by apiserver.srvRoot. Issues: This section describes the ways that the FakeAuthoriser diverges from the behaviour of the srvRoot authoriser. 1. In srvRoot, all authorisation stems ultimately from the state.Entity object referenced from the srvRoot instance. This is a problem as our authorisation system is based on the _tag_ passed over the api, not the state.Entity, although for practical purposes the tag of the authenticated entity and the entity itself are interchangeable. Proposal: remove Authoriser.GetAuthEntity() and replace all calls with Authoriser.GetAuthTag(). I have a branch for this, it's < 30 lines as there are only 3 places in the apiserver where we do this and they usually call the .Tag method on the entity they get back from GetAuthEntity. 2. This extends from point 1, while the svrRoot derives everything from svrRoot.entity, the FakeAuthoriser allows the caller to pass a unique value for the tag and the entity of the authorisee. This leads to impossible situations where the FakeAuthorizer returns nil for AuthGetEntity but a non nil value for AuthGetTag. Other permutations exist in our tests and are expected by the test logic. Propsal: Once step 1 is fixed the difference between svrRoot and FakeAuthoriser is the former starts from a state.Entity and derives a tag from which other values are derived, the latter skips the initial step and starts from a tag. This work falls out of the solution to steps 1 and 3. 3. The helper methods, AuthMachineAgent(), AuthUnitAgent() on the Authoriser interface are implemented differently in svrRoot and FakeAuthoriser. In tests it is quite common to take the FakeAuthoriser from the test suite, copy it and change some of these values leading to impossible situations, ie, the tag or entity of the FakeAuthoriser pointing to a Unit, but AuthMachineAgent() returning true. Proposal: The simplest solution is to copy the implementation of these helper methods from svrRoot to FakeAuthoriser. A more involved solution would be to factor these methods out to be functions in the apiserver/common package that take an Authorizer. This second step may not pay for itself. These steps resolves the majority of the discontinuity between the two implementations and will resolve a set of blocking issues I've hit converting the apiserver and state packages to speak tags natively. Thanks for your time Dave -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Possible landing bot issues
I managed to sneak one in an hour ago, is the fix live ? On Thu, Jul 31, 2014 at 10:18 AM, Ian Booth wrote: >> >> I think there are two problems with the current landing bot. >> > > Thanks for raising the issues. > > >> 1. the bot doesn't appear to NACK changes that are not gofmt'd, this >> means changes land which others cannot fix because their .gitconfig >> prevents them. >> > > Still on the todo list, sorry. Will be fixed asap. > >> 2. the bot doesn't appear to be enforcing it's lockdown rules as the >> comment it places on the PR contains the magic string that the bot >> looks for when it see's another $$ instruction. >> > > Curtis has already fixed this plus landed some other imporvements (see other > emails). > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/juju-dev -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Possible landing bot issues
> > I think there are two problems with the current landing bot. > Thanks for raising the issues. > 1. the bot doesn't appear to NACK changes that are not gofmt'd, this > means changes land which others cannot fix because their .gitconfig > prevents them. > Still on the todo list, sorry. Will be fixed asap. > 2. the bot doesn't appear to be enforcing it's lockdown rules as the > comment it places on the PR contains the magic string that the bot > looks for when it see's another $$ instruction. > Curtis has already fixed this plus landed some other imporvements (see other emails). -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Possible landing bot issues
Hello, I think there are two problems with the current landing bot. 1. the bot doesn't appear to NACK changes that are not gofmt'd, this means changes land which others cannot fix because their .gitconfig prevents them. 2. the bot doesn't appear to be enforcing it's lockdown rules as the comment it places on the PR contains the magic string that the bot looks for when it see's another $$ instruction. Thank you for your time. Dave -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: CI regressions will block merges
mgz advised me to revise the syntax to avoid confusing the Juju bot > Adding $$fixes-$$ as a comment to the PR will signal it > is allowed to test. The token can be in any comment and in any part When a merge testing into stable or devel is blocked by a critical regression, you can signal your branch fixes that regression with this syntax __fixed-__ Juju bot is currently looking for these tokens in a comment to test merges into master. __fixes-1347715__ __fixes-1342725__ Merges into 1.20 are unblocked I also added a token to force a test merge. There are cases where it is critical to merge and there is not critical regression. Maybe you need to merge into both 1.20 and master. When we judge the branch needs to merge, add this token to a comment: __JFDI__ -- Curtis Hovey Canonical Cloud Development and Operations http://launchpad.net/~sinzui -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: help please: mongo/mgo panic
Alright, the guess last night was correct, and the candidate fix as well. I've managed to reproduce the problem by stressing out the scenario described with 4 concurrent runners running the following two operations, meanwhile the chaos mechanism injects random slowdowns in various critical points: []txn.Op{{ C: "accounts", Id: 0, Update: M{"$inc": M{"balance": 1}}, }, { C: "accounts", Id: 1, Update: M{"$inc": M{"balance": 1}}, }} To reach the bug, the stress test also has to run half of the transactions in this order, and the other half with these same operations in the opposite order, so that dependency cycles are created between the transactions. Note that the txn package guarantees that operations are always executed in the order provided in the transaction. The fix and the complete test is available in this change: https://github.com/go-mgo/mgo/commit/3bc3ddaa The numbers there are lower to run in a reasonable mount of time, but to give some confidence on the fix and the code in general, I've run this test for 100k transactions being concurrently executed with no problems. Also, to give a better perspective of the sort of outcome that the logic for concurrent runners produce, this output was generated by that test while running for 100 transactions: http://paste.ubuntu.com/7906618/ The tokens like "a)" in these lines are the unique identifier for a given transaction runner. Note how every single operation is executed in precise lock-step despite the concurrency and the ordering issues, even assigning the same revision to both documents since they were created together. Also, perhaps most interestingly, note the occurrences such as: [LOG] 0:00.180 b) Applying 53d92a4bca654539e703_7791e1dc op 0 (update) on {accounts 0} with txn-revno 2: DONE [LOG] 0:00.186 d) Applying 53d92a4bca654539e703_7791e1dc op 1 (update) on {accounts 1} with txn-revno 2: DONE Note the first one is "b)" while the second one is "d)", which means there are two completely independent runners, in different goroutines (might as well be different machines), collaborating towards the completion of a single transaction. So, I believe this is sorted. Please let me know how it goes there. On Wed, Jul 30, 2014 at 4:14 AM, Gustavo Niemeyer wrote: > Okay, I couldn't resist investigating a bit. I've been looking at the > database dump from earlier today and it's smelling like a simpler bug > in the txn package, and I might have found the cause already. > > Here is a quick walkthrough while debugging the problem, to also serve > as future aid in similar quests. > > Enabling full debug for the txn package with SetDebug and SetLogger, > and doing a ResumeAll to flush all pending transactions, we can > quickly get to the affected document and transaction: > > 2014/07/30 02:19:23 Resuming all unfinished transactions > 2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared" > 2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894 > 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894 > 2014/07/30 02:19:23 a) Rescanned queue with > 53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced > 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806 > 2014/07/30 02:19:23 a) Rescanned queue with > 53d6057930009a01ba0002eb_98124806: has prereqs, not forced > 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775 > 2014/07/30 02:19:23 a) Rescanned document {services ntp} misses > 53d6057930009a01ba0002ee_a83bd775 in queue: > [53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41 > 53 > d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd > 53d6057e30009a01ba000303_a26cb429] > 2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared" > panic: rescanned document misses transaction in queue > > So this error actually means something slightly different from what I > pointed out in the bug. > > The transaction runner state machine creates transactions in the > "preparing" state, and then moves it over to "prepared" when all > affected documents were tagged with the transaction id+nonce. So what > this means is that there is a transaction in progress in the prepared > state, while the actual document misses the id in its local queue, > which is an impossible situation unless the document was fiddled with, > there was corruption, or a bug in the code. > > So, let's have a look at the affected documents. First, the document > being changed: > > db.services.findOne({_id: "ntp"}) > http://paste.ubuntu.com/7902134/ > > We can see a few transactions in the queue, but the one raising the > issue is not there as reported by the error. > > And this is full transaction raised by the error: > > db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")}) > http://paste.ubuntu.com/7902095/ > > One i
Re: backup API in 1.20
On Wed, Jul 30, 2014 at 12:25 AM, John Meinel wrote: > So as discussed, I think we still want to support a HTTP GET based API for > actually downloading the content to the client. And the CLI can do steps > like: > RPC to a request to create a backup (either this is synchronous and returns > a URL, or the next step is to wait on the next request for when it is done > to get the URL) > HTTP GET to download the backup .tar (.xz,,bz2,gz whatever) > put those bytes somewhere locally and then exit. That's what I understood as well. -eric -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: help please: mongo/mgo panic
Okay, I couldn't resist investigating a bit. I've been looking at the database dump from earlier today and it's smelling like a simpler bug in the txn package, and I might have found the cause already. Here is a quick walkthrough while debugging the problem, to also serve as future aid in similar quests. Enabling full debug for the txn package with SetDebug and SetLogger, and doing a ResumeAll to flush all pending transactions, we can quickly get to the affected document and transaction: 2014/07/30 02:19:23 Resuming all unfinished transactions 2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared" 2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894 2014/07/30 02:19:23 a) Rescanned queue with 53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806 2014/07/30 02:19:23 a) Rescanned queue with 53d6057930009a01ba0002eb_98124806: has prereqs, not forced 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775 2014/07/30 02:19:23 a) Rescanned document {services ntp} misses 53d6057930009a01ba0002ee_a83bd775 in queue: [53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41 53 d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd 53d6057e30009a01ba000303_a26cb429] 2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared" panic: rescanned document misses transaction in queue So this error actually means something slightly different from what I pointed out in the bug. The transaction runner state machine creates transactions in the "preparing" state, and then moves it over to "prepared" when all affected documents were tagged with the transaction id+nonce. So what this means is that there is a transaction in progress in the prepared state, while the actual document misses the id in its local queue, which is an impossible situation unless the document was fiddled with, there was corruption, or a bug in the code. So, let's have a look at the affected documents. First, the document being changed: db.services.findOne({_id: "ntp"}) http://paste.ubuntu.com/7902134/ We can see a few transactions in the queue, but the one raising the issue is not there as reported by the error. And this is full transaction raised by the error: db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")}) http://paste.ubuntu.com/7902095/ One interesting thing we can do from here is verifying that other documents which are part of this transaction are properly tagged, and indeed they are, so this one document has the issue alone in this transaction. Good hint. Another detail worth checking is getting the set of all transactions involving the affected document since it was created, stripping operations from unrelated documents: db.txns.aggregate( {$match: {"o.c": "services", "o.d": "ntp"}}, {$unwind: "$o"}, {$match: {"o.c": "services", "o.d": "ntp"}}) http://paste.ubuntu.com/7902097/ This has quite a few useful hints about the problem. The document was created for quite a while, so that's most probably not related to the issue. The timing is also available, encoded inside the transaction id itself, and this is perhaps the best hint: around the missing transaction there were a number of transactions being applied on the same second, most of them pending, but one of them actually succeeds: 53d6057930009a01ba0002e9 (prepared and missing) 53d6057930009a01ba0002ea (prepared and in the queue) 53d6057930009a01ba0002eb (prepared and in the queue) 53d6057930009a01ba0002ec (applied) 53d6057930009a01ba0002ed (prepared and missing) 53d6057930009a01ba0002ee (prepared and missing) 53d6057930009a01ba0002ef (prepared and missing) Note how all of these are exactly the same except for the last bytes, which contain a simple incrementing counter. This means they were done concurrently and about the exact same time, and from the same client machine. So it's not one, but several missing transactions, on this one document, which had an applied transaction interleaved with other pending ones. This sounds very much like a problem on the logic that pops completed transactions from the queue. Looking through the code on that area, I could indeed find some logic that comes from the previous version of the algorithm, and apparently should not be there anymore. The fix is simple, and now that we know the scenario writing a test case that exercises the race shouldn't be hard either. But it's quite late here and it'd be unwise to not get some sleep. More on this tomorrow. On Wed, Jul 30, 2014 at 3:21 AM, John Meinel wrote: > Could this be something where we are getting transactions faster than we can > finalize them? > > John > =:-> > > > On Wed, Jul 30, 2014 at 9:17 AM, Gustavo Niemeyer > wrote: >> >> We've got a database dump yesterday, which giv