apiserver/testing.FakeAuthoriser is not a good mock

2014-07-30 Thread David Cheney
Hello,

This message serves as background to a series of changes coming soon.

The goal of this message, and the changes that follow, is to describe
the problems that I see with the FakeAuthoriser and a describe in
general terms the steps I propose to address those shortcomings.

Background:

The FakeAuthoriser implements the apiserver/common.Authoriser
interface and is used heavily in the apiserver tests. The other
implementation of this interface is the apiserver itself,
specifically, apiserver.srvRoot.

tf. The goal of the FakeAuthoriser is to imitate apiserver.svrRoot.

I assert that this is currently not true and that the tests are
structured to expect the behaviour of the FakeAuthorizer rather than
the contract of apiserver/common.Authoriser as defined by
apiserver.srvRoot.

Issues:

This section describes the ways that the FakeAuthoriser diverges from
the behaviour of the srvRoot authoriser.

1. In srvRoot, all authorisation stems ultimately from the
state.Entity object referenced from the srvRoot instance. This is a
problem as our authorisation system is based on the _tag_ passed over
the api, not the state.Entity, although for practical purposes the tag
of the authenticated entity and the entity itself are interchangeable.

Proposal: remove Authoriser.GetAuthEntity() and replace all calls with
Authoriser.GetAuthTag(). I have a branch for this, it's < 30 lines as
there are only 3 places in the apiserver where we do this and they
usually call the .Tag method on the entity they get back from
GetAuthEntity.

2. This extends from point 1, while the svrRoot derives everything
from svrRoot.entity, the FakeAuthoriser allows the caller to pass a
unique value for the tag and the entity of the authorisee. This leads
to impossible situations where the FakeAuthorizer returns nil for
AuthGetEntity but a non nil value for AuthGetTag. Other permutations
exist in our tests and are expected by the test logic.

Propsal: Once step 1 is fixed the difference between svrRoot and
FakeAuthoriser is the former starts from a state.Entity and derives a
tag from which other values are derived, the latter skips the initial
step and starts from a tag. This work falls out of the solution to
steps 1 and 3.

3. The helper methods, AuthMachineAgent(), AuthUnitAgent() on the
Authoriser interface are implemented differently in svrRoot and
FakeAuthoriser. In tests it is quite common to take the FakeAuthoriser
from the test suite, copy it and change some of these values leading
to impossible situations, ie, the tag or entity of the FakeAuthoriser
pointing to a Unit, but AuthMachineAgent() returning true.

Proposal: The simplest solution is to copy the implementation of these
helper methods from svrRoot to FakeAuthoriser. A more involved
solution would be to factor these methods out to be functions in the
apiserver/common package that take an Authorizer. This second step may
not pay for itself.

These steps resolves the majority of the discontinuity between the two
implementations and will resolve a set of blocking issues I've hit
converting the apiserver and state packages to speak tags natively.

Thanks for your time

Dave

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Possible landing bot issues

2014-07-30 Thread David Cheney
I managed to sneak one in an hour ago, is the fix live ?

On Thu, Jul 31, 2014 at 10:18 AM, Ian Booth  wrote:
>>
>> I think there are two problems with the current landing bot.
>>
>
> Thanks for raising the issues.
>
>
>> 1. the bot doesn't appear to NACK changes that are not gofmt'd, this
>> means changes land which others cannot fix because their .gitconfig
>> prevents them.
>>
>
> Still on the todo list, sorry. Will be fixed asap.
>
>> 2. the bot doesn't appear to be enforcing it's lockdown rules as the
>> comment it places on the PR contains the magic string that the bot
>> looks for when it see's another $$ instruction.
>>
>
> Curtis has already fixed this plus landed some other imporvements (see other
> emails).
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at: 
> https://lists.ubuntu.com/mailman/listinfo/juju-dev

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Possible landing bot issues

2014-07-30 Thread Ian Booth
> 
> I think there are two problems with the current landing bot.
>

Thanks for raising the issues.


> 1. the bot doesn't appear to NACK changes that are not gofmt'd, this
> means changes land which others cannot fix because their .gitconfig
> prevents them.
> 

Still on the todo list, sorry. Will be fixed asap.

> 2. the bot doesn't appear to be enforcing it's lockdown rules as the
> comment it places on the PR contains the magic string that the bot
> looks for when it see's another $$ instruction.
> 

Curtis has already fixed this plus landed some other imporvements (see other
emails).

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Possible landing bot issues

2014-07-30 Thread David Cheney
Hello,

I think there are two problems with the current landing bot.

1. the bot doesn't appear to NACK changes that are not gofmt'd, this
means changes land which others cannot fix because their .gitconfig
prevents them.

2. the bot doesn't appear to be enforcing it's lockdown rules as the
comment it places on the PR contains the magic string that the bot
looks for when it see's another $$ instruction.

Thank you for your time.

Dave

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: CI regressions will block merges

2014-07-30 Thread Curtis Hovey-Canonical
mgz advised me to revise the syntax to avoid confusing the Juju bot

> Adding $$fixes-$$ as a comment to the PR will signal it
> is allowed to test. The token can be in any comment and in any part

When a merge testing into stable or devel is blocked by a critical
regression, you can signal your branch fixes that regression with this
syntax
__fixed-__

Juju bot is currently looking for these tokens in a comment to test
merges into master.
__fixes-1347715__
__fixes-1342725__

Merges into 1.20 are unblocked

I also added a token to force a test merge. There are cases where it
is critical to merge and there is not critical regression. Maybe you
need to merge into both 1.20 and master. When we judge the branch
needs to merge, add this token to a comment:
__JFDI__

-- 
Curtis Hovey
Canonical Cloud Development and Operations
http://launchpad.net/~sinzui

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: help please: mongo/mgo panic

2014-07-30 Thread Gustavo Niemeyer
Alright, the guess last night was correct, and the candidate fix as
well. I've managed to reproduce the problem by stressing out the
scenario described with 4 concurrent runners running the following two
operations, meanwhile the chaos mechanism injects random slowdowns in
various critical points:

[]txn.Op{{
C: "accounts",
Id: 0,
Update: M{"$inc": M{"balance": 1}},
}, {
C: "accounts",
Id: 1,
Update: M{"$inc": M{"balance": 1}},
}}

To reach the bug, the stress test also has to run half of the
transactions in this order, and the other half with these same
operations in the opposite order, so that dependency cycles are
created between the transactions. Note that the txn package guarantees
that operations are always executed in the order provided in the
transaction.

The fix and the complete test is available in this change:

https://github.com/go-mgo/mgo/commit/3bc3ddaa

The numbers there are lower to run in a reasonable mount of time, but
to give some confidence on the fix and the code in general, I've run
this test for 100k transactions being concurrently executed with no
problems.

Also, to give a better perspective of the sort of outcome that the
logic for concurrent runners produce, this output was generated by
that test while running for 100 transactions:

http://paste.ubuntu.com/7906618/

The tokens like "a)" in these lines are the unique identifier for a
given transaction runner. Note how every single operation is executed
in precise lock-step despite the concurrency and the ordering issues,
even assigning the same revision to both documents since they were
created together.

Also, perhaps most interestingly, note the occurrences such as:

[LOG] 0:00.180 b) Applying 53d92a4bca654539e703_7791e1dc op 0
(update) on {accounts 0} with txn-revno 2: DONE
[LOG] 0:00.186 d) Applying 53d92a4bca654539e703_7791e1dc op 1
(update) on {accounts 1} with txn-revno 2: DONE

Note the first one is "b)" while the second one is "d)", which means
there are two completely independent runners, in different goroutines
(might as well be different machines), collaborating towards the
completion of a single transaction.

So, I believe this is sorted. Please let me know how it goes there.



On Wed, Jul 30, 2014 at 4:14 AM, Gustavo Niemeyer
 wrote:
> Okay, I couldn't resist investigating a bit. I've been looking at the
> database dump from earlier today and it's smelling like a simpler bug
> in the txn package, and I might have found the cause already.
>
> Here is a quick walkthrough while debugging the problem, to also serve
> as future aid in similar quests.
>
> Enabling full debug for the txn package with SetDebug and SetLogger,
> and doing a ResumeAll to flush all pending transactions, we can
> quickly get to the affected document and transaction:
>
> 2014/07/30 02:19:23 Resuming all unfinished transactions
> 2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared"
> 2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894
> 2014/07/30 02:19:23 a) Rescanned queue with
> 53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806
> 2014/07/30 02:19:23 a) Rescanned queue with
> 53d6057930009a01ba0002eb_98124806: has prereqs, not forced
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775
> 2014/07/30 02:19:23 a) Rescanned document {services ntp} misses
> 53d6057930009a01ba0002ee_a83bd775 in queue:
> [53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41
> 53
> d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd
> 53d6057e30009a01ba000303_a26cb429]
> 2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared"
> panic: rescanned document misses transaction in queue
>
> So this error actually means something slightly different from what I
> pointed out in the bug.
>
> The transaction runner state machine creates transactions in the
> "preparing" state, and then moves it over to "prepared" when all
> affected documents were tagged with the transaction id+nonce. So what
> this means is that there is a transaction in progress in the prepared
> state, while the actual document misses the id in its local queue,
> which is an impossible situation unless the document was fiddled with,
> there was corruption, or a bug in the code.
>
> So, let's have a look at the affected documents. First, the document
> being changed:
>
> db.services.findOne({_id: "ntp"})
> http://paste.ubuntu.com/7902134/
>
> We can see a few transactions in the queue, but the one raising the
> issue is not there as reported by the error.
>
> And this is full transaction raised by the error:
>
> db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")})
> http://paste.ubuntu.com/7902095/
>
> One i

Re: backup API in 1.20

2014-07-30 Thread Eric Snow
On Wed, Jul 30, 2014 at 12:25 AM, John Meinel  wrote:
> So as discussed, I think we still want to support a HTTP GET based API for
> actually downloading the content to the client. And the CLI can do steps
> like:
>  RPC to a request to create a backup (either this is synchronous and returns
> a URL, or the next step is to wait on the next request for when it is done
> to get the URL)
>  HTTP GET to download the backup .tar (.xz,,bz2,gz whatever)
> put those bytes somewhere locally and then exit.

That's what I understood as well.

-eric

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: help please: mongo/mgo panic

2014-07-30 Thread Gustavo Niemeyer
Okay, I couldn't resist investigating a bit. I've been looking at the
database dump from earlier today and it's smelling like a simpler bug
in the txn package, and I might have found the cause already.

Here is a quick walkthrough while debugging the problem, to also serve
as future aid in similar quests.

Enabling full debug for the txn package with SetDebug and SetLogger,
and doing a ResumeAll to flush all pending transactions, we can
quickly get to the affected document and transaction:

2014/07/30 02:19:23 Resuming all unfinished transactions
2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared"
2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002eb_98124806: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775
2014/07/30 02:19:23 a) Rescanned document {services ntp} misses
53d6057930009a01ba0002ee_a83bd775 in queue:
[53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41
53
d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd
53d6057e30009a01ba000303_a26cb429]
2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared"
panic: rescanned document misses transaction in queue

So this error actually means something slightly different from what I
pointed out in the bug.

The transaction runner state machine creates transactions in the
"preparing" state, and then moves it over to "prepared" when all
affected documents were tagged with the transaction id+nonce. So what
this means is that there is a transaction in progress in the prepared
state, while the actual document misses the id in its local queue,
which is an impossible situation unless the document was fiddled with,
there was corruption, or a bug in the code.

So, let's have a look at the affected documents. First, the document
being changed:

db.services.findOne({_id: "ntp"})
http://paste.ubuntu.com/7902134/

We can see a few transactions in the queue, but the one raising the
issue is not there as reported by the error.

And this is full transaction raised by the error:

db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")})
http://paste.ubuntu.com/7902095/

One interesting thing we can do from here is verifying that other
documents which are part of this transaction are properly tagged, and
indeed they are, so this one document has the issue alone in this
transaction. Good hint.

Another detail worth checking is getting the set of all transactions
involving the affected document since it was created, stripping
operations from unrelated documents:

db.txns.aggregate(
{$match: {"o.c": "services", "o.d": "ntp"}},
{$unwind: "$o"},
{$match: {"o.c": "services", "o.d": "ntp"}})
http://paste.ubuntu.com/7902097/

This has quite a few useful hints about the problem. The document was
created for quite a while, so that's most probably not related to the
issue. The timing is also available, encoded inside the transaction id
itself, and this is perhaps the best hint: around the missing
transaction there were a number of transactions being applied on the
same second, most of them pending, but one of them actually succeeds:

53d6057930009a01ba0002e9 (prepared and missing)
53d6057930009a01ba0002ea (prepared and in the queue)
53d6057930009a01ba0002eb (prepared and in the queue)
53d6057930009a01ba0002ec (applied)
53d6057930009a01ba0002ed (prepared and missing)
53d6057930009a01ba0002ee (prepared and missing)
53d6057930009a01ba0002ef (prepared and missing)

Note how all of these are exactly the same except for the last bytes,
which contain a simple incrementing counter. This means they were done
concurrently and about the exact same time, and from the same client
machine. So it's not one, but several missing transactions, on this
one document, which had an applied transaction interleaved with other
pending ones. This sounds very much like a problem on the logic that
pops completed transactions from the queue.

Looking through the code on that area, I could indeed find some logic
that comes from the previous version of the algorithm, and apparently
should not be there anymore. The fix is simple, and now that we know
the scenario writing a test case that exercises the race shouldn't be
hard either.

But it's quite late here and it'd be unwise to not get some sleep.
More on this tomorrow.


On Wed, Jul 30, 2014 at 3:21 AM, John Meinel  wrote:
> Could this be something where we are getting transactions faster than we can
> finalize them?
>
> John
> =:->
>
>
> On Wed, Jul 30, 2014 at 9:17 AM, Gustavo Niemeyer 
> wrote:
>>
>> We've got a database dump yesterday, which giv