[openstack-dev] [nova] Would an api option to create an instance without powering on be useful?

2018-11-30 Thread Matthew Booth
I have a request to do $SUBJECT in relation to a V2V workflow. The use
case here is conversion of a VM/Physical which was previously powered
off. We want to move its data, but we don't want to be powering on
stuff which wasn't previously on.

This would involve an api change, and a hopefully very small change in
drivers to support it. Technically I don't see it as an issue.

However, is it a change we'd be willing to accept? Is there any good
reason not to do this? Are there any less esoteric workflows which
might use this feature?

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][cinder] Disabling nova volume-update (aka swap volume; aka cinder live migration)

2018-08-22 Thread Matthew Booth
On Wed, 22 Aug 2018 at 10:47, Gorka Eguileor  wrote:
>
> On 20/08, Matthew Booth wrote:
> > For those who aren't familiar with it, nova's volume-update (also
> > called swap volume by nova devs) is the nova part of the
> > implementation of cinder's live migration (also called retype).
> > Volume-update is essentially an internal cinder<->nova api, but as
> > that's not a thing it's also unfortunately exposed to users. Some
> > users have found it and are using it, but because it's essentially an
> > internal cinder<->nova api it breaks pretty easily if you don't treat
> > it like a special snowflake. It looks like we've finally found a way
> > it's broken for non-cinder callers that we can't fix, even with a
> > dirty hack.
> >
> > volume-updateessentially does a live copy of the
> > data on  volume to  volume, then seamlessly swaps the
> > attachment to  from  to . The guest OS on 
> > will not notice anything at all as the hypervisor swaps the storage
> > backing an attached volume underneath it.
> >
> > When called by cinder, as intended, cinder does some post-operation
> > cleanup such that  is deleted and  inherits the same
> > volume_id; that is  effectively becomes . When called any
> > other way, however, this cleanup doesn't happen, which breaks a bunch
> > of assumptions. One of these is that a disk's serial number is the
> > same as the attached volume_id. Disk serial number, in KVM at least,
> > is immutable, so can't be updated during volume-update. This is fine
> > if we were called via cinder, because the cinder cleanup means the
> > volume_id stays the same. If called any other way, however, they no
> > longer match, at least until a hard reboot when it will be reset to
> > the new volume_id. It turns out this breaks live migration, but
> > probably other things too. We can't think of a workaround.
> >
> > I wondered why users would want to do this anyway. It turns out that
> > sometimes cinder won't let you migrate a volume, but nova
> > volume-update doesn't do those checks (as they're specific to cinder
> > internals, none of nova's business, and duplicating them would be
> > fragile, so we're not adding them!). Specifically we know that cinder
> > won't let you migrate a volume with snapshots. There may be other
> > reasons. If cinder won't let you migrate your volume, you can still
> > move your data by using nova's volume-update, even though you'll end
> > up with a new volume on the destination, and a slightly broken
> > instance. Apparently the former is a trade-off worth making, but the
> > latter has been reported as a bug.
> >
>
> Hi Matt,
>
> As you know, I'm in favor of making this REST API call only authorized
> for Cinder to avoid messing the cloud.
>
> I know you wanted Cinder to have a solution to do live migrations of
> volumes with snapshots, and while this is not possible to do in a
> reasonable fashion, I kept thinking about it given your strong feelings
> to provide a solution for users that really need this, and I think we
> may have a "reasonable" compromise.
>
> The solution is conceptually simple.  We add a new API microversion in
> Cinder that adds and optional parameter called "generic_keep_source"
> (defaults to False) to both migrate and retype operations.
>
> This means that if the driver optimized migration cannot do the
> migration and the generic migration code is the one doing the migration,
> then, instead of our final step being to swap the volume id's and
> deleting the source volume, what we would do is to swap the volume id's
> and move all the snapshots to reference the new volume.  Then we would
> create a user message with the new ID of the volume.
>
> This way we can preserve the old volume with all its snapshots and do
> the live migration.
>
> The implementation is a little bit tricky, as we'll have to add anew
> "update_migrated_volume" mechanism to support the renaming of both
> volumes, since the old one wouldn't work with this among other things,
> but it's doable.
>
> Unfortunately I don't have the time right now to work on this...

Sounds promising, and honestly more than I'd have hoped for.

Matt

-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][cinder] Disabling nova volume-update (aka swap volume; aka cinder live migration)

2018-08-20 Thread Matthew Booth
For those who aren't familiar with it, nova's volume-update (also
called swap volume by nova devs) is the nova part of the
implementation of cinder's live migration (also called retype).
Volume-update is essentially an internal cinder<->nova api, but as
that's not a thing it's also unfortunately exposed to users. Some
users have found it and are using it, but because it's essentially an
internal cinder<->nova api it breaks pretty easily if you don't treat
it like a special snowflake. It looks like we've finally found a way
it's broken for non-cinder callers that we can't fix, even with a
dirty hack.

volume-updateessentially does a live copy of the
data on  volume to  volume, then seamlessly swaps the
attachment to  from  to . The guest OS on 
will not notice anything at all as the hypervisor swaps the storage
backing an attached volume underneath it.

When called by cinder, as intended, cinder does some post-operation
cleanup such that  is deleted and  inherits the same
volume_id; that is  effectively becomes . When called any
other way, however, this cleanup doesn't happen, which breaks a bunch
of assumptions. One of these is that a disk's serial number is the
same as the attached volume_id. Disk serial number, in KVM at least,
is immutable, so can't be updated during volume-update. This is fine
if we were called via cinder, because the cinder cleanup means the
volume_id stays the same. If called any other way, however, they no
longer match, at least until a hard reboot when it will be reset to
the new volume_id. It turns out this breaks live migration, but
probably other things too. We can't think of a workaround.

I wondered why users would want to do this anyway. It turns out that
sometimes cinder won't let you migrate a volume, but nova
volume-update doesn't do those checks (as they're specific to cinder
internals, none of nova's business, and duplicating them would be
fragile, so we're not adding them!). Specifically we know that cinder
won't let you migrate a volume with snapshots. There may be other
reasons. If cinder won't let you migrate your volume, you can still
move your data by using nova's volume-update, even though you'll end
up with a new volume on the destination, and a slightly broken
instance. Apparently the former is a trade-off worth making, but the
latter has been reported as a bug.

I'd like to make it very clear that nova's volume-update, isn't
expected to work correctly except when called by cinder. Specifically
there was a proposal that we disable volume-update from non-cinder
callers in some way, possibly by asserting volume state that can only
be set by cinder. However, I'm also very aware that users are calling
volume-update because it fills a need, and we don't want to trap data
that wasn't previously trapped.

Firstly, is anybody aware of any other reasons to use nova's
volume-update directly?

Secondly, is there any reason why we shouldn't just document then you
have to delete snapshots before doing a volume migration? Hopefully
some cinder folks or operators can chime in to let me know how to back
them up or somehow make them independent before doing this, at which
point the volume itself should be migratable?

If we can establish that there's an acceptable alternative to calling
volume-update directly for all use-cases we're aware of, I'm going to
propose heading off this class of bug by disabling it for non-cinder
callers.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Matthew Booth
On Mon, 13 Aug 2018 at 16:56, Chris Friesen  wrote:
>
> On 08/13/2018 08:26 AM, Jay Pipes wrote:
> > On 08/13/2018 10:10 AM, Matthew Booth wrote:
>
> >> I suspect I've misunderstood, but I was arguing this is an anti-goal.
> >> There's no reason to do this if the db is working correctly, and it
> >> would violate the principal of least surprise in dbs with legacy
> >> datasets (being all current dbs). These values have always been mixed
> >> case, lets just leave them be and fix the db.
> >
> > Do you want case-insensitive keys or do you not want case-insensitive keys?
> >
> > It seems to me that people complain that MySQL is case-insensitive by 
> > default
> > but actually *like* the concept that a metadata key of "abc" should be 
> > "equal
> > to" a metadata key of "ABC".
>
> How do we behave on PostgreSQL?  (I realize it's unsupported, but it still has
> users.)  It's case-sensitive by default, do we override that?
>
> Personally, I've worked on case-sensitive systems long enough that I'd 
> actually
> be surprised if "abc" matched "ABC". :)

To the best of my knowledge, the hypothetical PostgreSQL db works
exactly how you, me, and pretty much any developer would expect :)
Honestly, though, SQLite is probably more interesting as it's at least
used for testing. SQLite's default collation is binary, which is
obviously case sensitive as you'd expect.

As a developer I'm heavily biased in favour of implementing the
simplest fix with the simplest and most obvious behaviour, which is to
change the default collation to do what everybody expected it did in
the first place (which is what Rajesh's patch does). As Jay points
out, though, I do concede that those pesky users may be impacted by
fixing this bug if they've come to rely on accidental buggy behaviour.

The question really comes down to how we can determine what the user
impact is for each solution. And here I'm talking about all the
various forms of metadata, assuming that whatever solution we picked
we'd apply to all. So:

- What API calls allow a user to query a thing by metadata key?

I believe these API calls would be the only things affected by fixing
the collation of metadata keys. If we know what they are we can ask
what the impact of changing the behaviour would be. Setting metadata
keys isn't subject to regression, as this was previously broken.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Matthew Booth
On Mon, 13 Aug 2018 at 15:27, Jay Pipes  wrote:
> Do you want case-insensitive keys or do you not want case-insensitive keys?
>
> It seems to me that people complain that MySQL is case-insensitive by
> default but actually *like* the concept that a metadata key of "abc"
> should be "equal to" a metadata key of "ABC".
>
> In other words, it seems to me that users actually expect that:
>
>  > nova aggregate-create agg1
>  > nova aggregate-set-metadata agg1 abc=1
>  > nova aggregate-set-metadata agg1 ABC=2
>
> should result in the original "abc" metadata item getting its value set
> to "2".

Incidentally, this particular example won't work today: it will just
throw an error. I believe the same would apply to user metadata on an
instance. IOW this particular example doesn't regress if you fix the
bug. The regression would be anything user-facing which queries by
metadata key. What does that?

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Matthew Booth
On Mon, 13 Aug 2018 at 15:27, Jay Pipes  wrote:
>
> On 08/13/2018 10:10 AM, Matthew Booth wrote:
> > On Mon, 13 Aug 2018 at 14:05, Jay Pipes  wrote:
> >>
> >> On 08/13/2018 06:06 AM, Matthew Booth wrote:
> >>> Thanks mriedem for answering my previous question, and also pointing
> >>> out the related previous spec around just forcing all metadata to be
> >>> lowercase:
> >>>
> >>> (Spec: approved in Newton) https://review.openstack.org/#/c/311529/
> >>> (Online migration: not merged, abandoned)
> >>> https://review.openstack.org/#/c/329737/
> >>>
> >>> There are other code patches, but the above is representative. What I
> >>> had read was the original bug:
> >>>
> >>> https://bugs.launchpad.net/nova/+bug/1538011
> >>>
> >>> The tl;dr is that the default collation used by MySQL results in a bug
> >>> when creating 2 metadata keys which differ only in case. The proposal
> >>> was obviously to simply make all metadata keys lower case. However, as
> >>> melwitt pointed out in the bug at the time that's a potentially user
> >>> hostile change. After some lost IRC discussion it seems that folks
> >>> believed at the time that to fix this properly would seriously
> >>> compromise the performance of these queries. The agreed way forward
> >>> was to allow existing keys to keep their case, but force new keys to
> >>> be lower case (so I wonder how the above online migration came
> >>> about?).
> >>>
> >>> Anyway, as Rajesh's patch shows, it's actually very easy just to fix
> >>> the MySQL misconfiguration:
> >>>
> >>> https://review.openstack.org/#/c/504885/
> >>>
> >>> So my question is, given that the previous series remains potentially
> >>> user hostile, the fix isn't as complex as previously believed, and it
> >>> doesn't involve a performance penalty, are there any other reasons why
> >>> we might want to resurrect it rather than just go with Rajesh's patch?
> >>> Or should we ask Rajesh to expand his patch into a series covering
> >>> other metadata?
> >>
> >> Keep in mind this patch is only related to *aggregate* metadata, AFAICT.
> >
> > Right, but the original bug pointed out that the same problem applies
> > equally to a bunch of different metadata stores. I haven't verified,
> > but the provenance was good ;) There would have to be other patches
> > for the other metadata stores.
>
> Yes, it is quite unfortunate that OpenStack has about 15 different ways
> of storing metadata key/value information.
>
> >>
> >> Any patch series that tries to "fix" this issue needs to include all of
> >> the following:
> >>
> >> * input automatically lower-cased [1]
> >> * inline (note: not online, inline) data migration inside the
> >> InstanceMeta object's _from_db_object() method for existing
> >> non-lowercased keys
> >
> > I suspect I've misunderstood, but I was arguing this is an anti-goal.
> > There's no reason to do this if the db is working correctly, and it
> > would violate the principal of least surprise in dbs with legacy
> > datasets (being all current dbs). These values have always been mixed
> > case, lets just leave them be and fix the db.
>
> Do you want case-insensitive keys or do you not want case-insensitive keys?
>
> It seems to me that people complain that MySQL is case-insensitive by
> default but actually *like* the concept that a metadata key of "abc"
> should be "equal to" a metadata key of "ABC".
>
> In other words, it seems to me that users actually expect that:
>
>  > nova aggregate-create agg1
>  > nova aggregate-set-metadata agg1 abc=1
>  > nova aggregate-set-metadata agg1 ABC=2
>
> should result in the original "abc" metadata item getting its value set
> to "2".
>
> If that isn't the case -- and I have a very different impression of what
> users *actually* expect from the CLI/UI -- then let me know.

I don't know what users want, tbh, I was simply coming from the POV of
not breaking the current behaviour. Although I think you're pointing
out that either solution breaks the current behaviour:

1. You lower case everything. This breaks users who query user
metadata and don't expect keys to be modified.

2. You fix the case sensitivity. This breaks users who add 'Foo' and
now expect to query 'foo'.

You're saying that although (2) is an artifact of a bug, there could
equally be people relying on it.

Eurgh. Yeah, that sucks.

Objectively though, I think I still like Rajesh's patch better because:

* It's vastly simpler to implement correctly and verifiably, and
therefore also less prone to future bugs.
* It's how it was originally intended to work.
* It's simpler to document.

Of these, the first is by far the most persuasive.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Matthew Booth
On Mon, 13 Aug 2018 at 14:05, Jay Pipes  wrote:
>
> On 08/13/2018 06:06 AM, Matthew Booth wrote:
> > Thanks mriedem for answering my previous question, and also pointing
> > out the related previous spec around just forcing all metadata to be
> > lowercase:
> >
> > (Spec: approved in Newton) https://review.openstack.org/#/c/311529/
> > (Online migration: not merged, abandoned)
> > https://review.openstack.org/#/c/329737/
> >
> > There are other code patches, but the above is representative. What I
> > had read was the original bug:
> >
> > https://bugs.launchpad.net/nova/+bug/1538011
> >
> > The tl;dr is that the default collation used by MySQL results in a bug
> > when creating 2 metadata keys which differ only in case. The proposal
> > was obviously to simply make all metadata keys lower case. However, as
> > melwitt pointed out in the bug at the time that's a potentially user
> > hostile change. After some lost IRC discussion it seems that folks
> > believed at the time that to fix this properly would seriously
> > compromise the performance of these queries. The agreed way forward
> > was to allow existing keys to keep their case, but force new keys to
> > be lower case (so I wonder how the above online migration came
> > about?).
> >
> > Anyway, as Rajesh's patch shows, it's actually very easy just to fix
> > the MySQL misconfiguration:
> >
> > https://review.openstack.org/#/c/504885/
> >
> > So my question is, given that the previous series remains potentially
> > user hostile, the fix isn't as complex as previously believed, and it
> > doesn't involve a performance penalty, are there any other reasons why
> > we might want to resurrect it rather than just go with Rajesh's patch?
> > Or should we ask Rajesh to expand his patch into a series covering
> > other metadata?
>
> Keep in mind this patch is only related to *aggregate* metadata, AFAICT.

Right, but the original bug pointed out that the same problem applies
equally to a bunch of different metadata stores. I haven't verified,
but the provenance was good ;) There would have to be other patches
for the other metadata stores.

>
> Any patch series that tries to "fix" this issue needs to include all of
> the following:
>
> * input automatically lower-cased [1]
> * inline (note: not online, inline) data migration inside the
> InstanceMeta object's _from_db_object() method for existing
> non-lowercased keys

I suspect I've misunderstood, but I was arguing this is an anti-goal.
There's no reason to do this if the db is working correctly, and it
would violate the principal of least surprise in dbs with legacy
datasets (being all current dbs). These values have always been mixed
case, lets just leave them be and fix the db.

> * change the collation of the aggregate_metadata.key column (note: this
> will require an entire rebuild of the table, since this column is part
> of a unique constraint [3]

Rajesh's patch changes the collation of the table, which I would
assume applies to its columns? I assume this is going to be a
moderately expensive, but one-off, operation similar in cost to adding
a new unique constraint.

> * online data migration for migrating non-lowercased keys to their
> lowercased counterpars (essentially doing `UPDATE key = LOWER(key) WHERE
> LOWER(key) != key` once the collation has been changed)

> None of the above touches the API layer. I suppose some might argue that
> the REST API should be microversion-bumped since the expected behaviour
> of the API will change (data will be transparently changed in one
> version of the API and not another). I don't personally think that's
> something I would require a microversion for, but who knows what others
> may say.

Again, I was considering this is an anti-goal. As I understand,
Rajesh's patch removes the requirement to make this api change. What
did I miss?

Thanks,

Matt

>
> Best,
> -jay
>
> [1]
> https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L295
> and
> https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L331
> and
> https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L356
>
>
> [2]
> https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L248
>
> [3]
> https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/db/sqlalchemy/api_models.py#L64
>
> __
> OpenStack Development Mailing List (not for usage questions)
> U

[openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Matthew Booth
Thanks mriedem for answering my previous question, and also pointing
out the related previous spec around just forcing all metadata to be
lowercase:

(Spec: approved in Newton) https://review.openstack.org/#/c/311529/
(Online migration: not merged, abandoned)
https://review.openstack.org/#/c/329737/

There are other code patches, but the above is representative. What I
had read was the original bug:

https://bugs.launchpad.net/nova/+bug/1538011

The tl;dr is that the default collation used by MySQL results in a bug
when creating 2 metadata keys which differ only in case. The proposal
was obviously to simply make all metadata keys lower case. However, as
melwitt pointed out in the bug at the time that's a potentially user
hostile change. After some lost IRC discussion it seems that folks
believed at the time that to fix this properly would seriously
compromise the performance of these queries. The agreed way forward
was to allow existing keys to keep their case, but force new keys to
be lower case (so I wonder how the above online migration came
about?).

Anyway, as Rajesh's patch shows, it's actually very easy just to fix
the MySQL misconfiguration:

https://review.openstack.org/#/c/504885/

So my question is, given that the previous series remains potentially
user hostile, the fix isn't as complex as previously believed, and it
doesn't involve a performance penalty, are there any other reasons why
we might want to resurrect it rather than just go with Rajesh's patch?
Or should we ask Rajesh to expand his patch into a series covering
other metadata?

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] CI job running functional against a mysql DB

2018-08-13 Thread Matthew Booth
I was reviewing https://review.openstack.org/#/c/504885/ . The change
looks good to me and I believe the test included exercises the root
cause of the problem. However, I'd like to be certain that the test
has been executed against MySQL rather than, eg, SQLite.

Zuul has voted +1 on the change. Can anybody tell me if any of those
jobs ran the included functional test against a MySQL DB?,

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matthew Booth
On 6 June 2018 at 13:55, Jay Pipes  wrote:
> On 06/06/2018 07:46 AM, Matthew Booth wrote:
>>
>> TL;DR I think we need to entirely disable swap volume for multiattach
>> volumes, and this will be an api breaking change with no immediate
>> workaround.
>>
>> I was looking through tempest and came across
>>
>> api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
>> This test does:
>>
>> Create 2 multiattach volumes
>> Create 2 servers
>> Attach volume 1 to both servers
>> ** Swap volume 1 for volume 2  on server 1 **
>> Check all is attached as expected
>>
>> The problem with this is that swap volume is a copy operation.
>
>
> Is it, though? The original blueprint and implementation seem to suggest
> that the swap_volume operation was nothing more than changing the mountpoint
> for a volume to point to a different location (in a safe
> manner that didn't lose any reads or writes).
>
> https://blueprints.launchpad.net/nova/+spec/volume-swap
>
> Nothing about the description of swap_volume() in the virt driver interface
> mentions swap_volume() being a "copy operation":
>
> https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476
>
>> We don't just replace one volume with another, we copy the contents
>> from one to the other and then do the swap. We do this with a qemu
>> drive mirror operation, which is able to do this copy safely without
>> needing to make the source read-only because it can also track writes
>> to the source and ensure the target is updated again. Here's a link
>> to the libvirt logs showing a drive mirror operation during the swap
>> volume of an execution of the above test:
>
> After checking the source code, the libvirt virt driver is the only virt
> driver that implements swap_volume(), so it looks to me like a public HTTP
> API method was added that was specific to libvirt's implementation of drive
> mirroring. Yay, more implementation leaking out through the API.
>
>>
>> http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201
>>
>> The problem is that when the volume is attached to more than one VM,
>> the hypervisor doing the drive mirror *doesn't* know about writes on
>> the other attached VMs, so it can't do that copy safely, and the
>> result is data corruption.
>
>
> Would it be possible to swap the volume by doing what Vish originally
> described in the blueprint: pause the VM, swap the volume mountpoints
> (potentially after migrating the underlying volume), start the VM?
>
>>
>  Note that swap volume isn't visible to the
>>
>> guest os, so this can't be addressed by the user. This is a data
>> corrupter, and we shouldn't allow it. However, it is in released code
>> and users might be doing it already, so disabling it would be a
>> user-visible api change with no immediate workaround.
>
>
> I'd love to know who is actually using the swap_volume() functionality,
> actually. I'd especially like to know who is using swap_volume() with
> multiattach.
>
>> However, I think we're attempting to do the wrong thing here anyway,
>> and the above tempest test is explicit testing behaviour that we don't
>> want. The use case for swap volume is that a user needs to move volume
>> data for attached volumes, e.g. to new faster/supported/maintained
>> hardware.
>
>
> Is that the use case?
>
> As was typical, there's no mention of a use case on the original blueprint.
> It just says "This feature allows a user or administrator to transparently
> swap out a cinder volume that connected to an instance." Which is hardly a
> use case since it uses the feature name in a description of the feature
> itself. :(
>
> The commit message (there was only a single commit for this functionality
> [1]) mentions overwriting data on the new volume:
>
>   Adds support for transparently swapping an attached volume with
>   another volume. Note that this overwrites all data on the new volume
>   with data from the old volume.
>
> Yes, that is the commit message in its entirety. Of course, the commit had
> no documentation at all in it, so there's no ability to understand what the
> original use case really was here.
>
> https://review.openstack.org/#/c/28995/
>
> If the use case was really "that a user needs to move volume data for
> attached volumes", why not just pause the VM, detach the volume, do a
> openstack volume migrate to the new destination, reattach the volume and
> start the VM? Th

[openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matthew Booth
TL;DR I think we need to entirely disable swap volume for multiattach
volumes, and this will be an api breaking change with no immediate
workaround.

I was looking through tempest and came across
api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
This test does:

Create 2 multiattach volumes
Create 2 servers
Attach volume 1 to both servers
** Swap volume 1 for volume 2  on server 1 **
Check all is attached as expected

The problem with this is that swap volume is a copy operation. We
don't just replace one volume with another, we copy the contents from
one to the other and then do the swap. We do this with a qemu drive
mirror operation, which is able to do this copy safely without needing
to make the source read-only because it can also track writes to the
source and ensure the target is updated again. Here's a link to the
libvirt logs showing a drive mirror operation during the swap volume
of an execution of the above test:

http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201

The problem is that when the volume is attached to more than one VM,
the hypervisor doing the drive mirror *doesn't* know about writes on
the other attached VMs, so it can't do that copy safely, and the
result is data corruption. Note that swap volume isn't visible to the
guest os, so this can't be addressed by the user. This is a data
corrupter, and we shouldn't allow it. However, it is in released code
and users might be doing it already, so disabling it would be a
user-visible api change with no immediate workaround.

However, I think we're attempting to do the wrong thing here anyway,
and the above tempest test is explicit testing behaviour that we don't
want. The use case for swap volume is that a user needs to move volume
data for attached volumes, e.g. to new faster/supported/maintained
hardware. With single attach that's exactly what they get: the end
user should never notice. With multi-attach they don't get that. We're
basically forking the shared volume at a point in time, with the
instance which did the swap writing to the new location while all
others continue writing to the old location. Except that even the fork
is broken, because they'll get a corrupt, inconsistent copy rather
than point in time. I can't think of a use case for this behaviour,
and it certainly doesn't meet the original design intent.

What they really want is for the multi-attached volume to be copied
from location a to location b and for all attachments to be updated.
Unfortunately I don't think we're going to be in a position to do that
any time soon, but I also think users will be unhappy if they're no
longer able to move data at all because it's multi-attach. We can
compromise, though, if we allow a multiattach volume to be moved as
long as it only has a single attachment. This means the operator can't
move this data without disruption to users, but at least it's not
fundamentally immovable.

This would require some cooperation with cinder to achieve, as we need
to be able to temporarily prevent cinder from allowing new
attachments. A natural way to achieve this would be to allow a
multi-attach volume with only a single attachment to be redesignated
not multiattach, but there might be others. The flow would then be:

Detach volume from server 2
Set multiattach=False on volume
Migrate volume on server 1
Set multiattach=True on volume
Attach volume to server 2

Combined with a patch to nova to disallow swap_volume on any
multiattach volume, this would then be possible if inconvenient.

Regardless of any other changes, though, I think it's urgent that we
disable the ability to swap_volume a multiattach volume because we
don't want users to start using this relatively new, but broken,
feature.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate

2018-04-19 Thread Matthew Booth
On 19 April 2018 at 16:46, Chris Friesen <chris.frie...@windriver.com> wrote:
> On 04/19/2018 08:33 AM, Jay Pipes wrote:
>>
>> On 04/19/2018 09:15 AM, Matthew Booth wrote:
>>>
>>> We've had inconsistent naming of recreate/evacuate in Nova for a long
>>> time, and it will persist in a couple of places for a while more.
>>> However, I've proposed the following to rename 'recreate' to
>>> 'evacuate' everywhere with no rpc/api impact here:
>>>
>>> https://review.openstack.org/560900
>>>
>>> One of the things which is renamed is the driver 'supports_recreate'
>>> capability, which I've renamed to 'supports_evacuate'. The above
>>> change updates this for in-tree drivers, but as noted in review this
>>> would impact out-of-tree drivers. If this might affect you, please
>>> follow the above in case it merges.
>>
>>
>> I have to admit, Matt, I'm a bit confused by this. I was under the
>> impression
>> that we were trying to *remove* uses of the term "evacuate" as much as
>> possible
>> because that term is not adequately descriptive of the operation and terms
>> like
>> "recreate" were more descriptive?
>
>
> This is a good point.
>
> Personally I'd prefer to see it go the other way and convert everything to
> the "recreate" terminology, including the external API.
>
> From the CLI perspective, it makes no sense that "nova evacuate" operates
> after a host is already down, but "nova evacuate-live" operates on a running
> host.

A bit OT, but evacuate-live probably shouldn't exist at all for a
variety of reasons. The implementation is shonky, it's doing
orchestration in the CLI, and the name is misleading, as you say.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate

2018-04-19 Thread Matthew Booth
On 19 April 2018 at 15:33, Jay Pipes <jaypi...@gmail.com> wrote:
> On 04/19/2018 09:15 AM, Matthew Booth wrote:
>>
>> We've had inconsistent naming of recreate/evacuate in Nova for a long
>> time, and it will persist in a couple of places for a while more.
>> However, I've proposed the following to rename 'recreate' to
>> 'evacuate' everywhere with no rpc/api impact here:
>>
>> https://review.openstack.org/560900
>>
>> One of the things which is renamed is the driver 'supports_recreate'
>> capability, which I've renamed to 'supports_evacuate'. The above
>> change updates this for in-tree drivers, but as noted in review this
>> would impact out-of-tree drivers. If this might affect you, please
>> follow the above in case it merges.
>
>
> I have to admit, Matt, I'm a bit confused by this. I was under the
> impression that we were trying to *remove* uses of the term "evacuate" as
> much as possible because that term is not adequately descriptive of the
> operation and terms like "recreate" were more descriptive?

I'm ambivalent, tbh, but I think it's better to pick one. I thought
we'd picked 'evacuate' based on the TODOs from Matt R:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2985
http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n3093

Incidentally, this isn't at all core to what I'm working on, but I'm
about to start poking it and thought I'd tidy up as I go (as is my
wont). If there's discussion to be had I don't mind dropping this and
moving on.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate

2018-04-19 Thread Matthew Booth
We've had inconsistent naming of recreate/evacuate in Nova for a long
time, and it will persist in a couple of places for a while more.
However, I've proposed the following to rename 'recreate' to
'evacuate' everywhere with no rpc/api impact here:

https://review.openstack.org/560900

One of the things which is renamed is the driver 'supports_recreate'
capability, which I've renamed to 'supports_evacuate'. The above
change updates this for in-tree drivers, but as noted in review this
would impact out-of-tree drivers. If this might affect you, please
follow the above in case it merges.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] z/VM introducing a new config driveformat

2018-04-18 Thread Matthew Booth
On 18 April 2018 at 15:10, Dan Smith <d...@danplanet.com> wrote:
>> Thanks for the concern and fully under it , the major reason is
>> cloud-init doesn't have a hook or plugin before it start to read
>> config drive (ISO disk) z/VM is an old hypervisor and no way to do
>> something like libvirt to define a ISO format disk in xml definition,
>> instead, it can define disks in the defintion of virtual machine and
>> let VM to decide its format.
>>
>> so we need a way to tell cloud-init where to find ISO file before
>> cloud-init start but without AE, we can't handle that...some update on
>> the spec here for further information
>> https://review.openstack.org/#/c/562154/
>
> The ISO format does not come from telling libvirt something about
> it. The host creates and formats the image, adds the data, and then
> attaches it to the instance. The latter part is the only step that
> involves configuring libvirt to attach the image to the instance. The
> rest is just stuff done by nova-compute (and the virt driver) on the
> linux system it's running on. That's the same arrangement as your
> driver, AFAICT.
>
> You're asking the system to hypervisor (or something running on it) to
> grab the image from glance, pre-filled with data. This is no different,
> except that the configdrive image comes from the system running the
> compute service. I don't see how it's any different in actual hypervisor
> mechanics, and thus feel like there _has_ to be a way to do this without
> the AE magic agent.

Having briefly read the cloud-init snippet which was linked earlier in
this thread, the requirement seems to be that the guest exposes the
device as /dev/srX or /dev/cdX. So I guess in order to make this work:

* You need to tell z/VM to expose the virtual disk as an optical disk
* The z/VM kernel needs to call optical disks /dev/srX or /dev/cdX

> I agree with Mikal that needing more agent behavior than cloud-init does
> a disservice to the users.
>
> I feel like we get a lot of "but no, my hypervisor is special!"
> reasoning when people go to add a driver to nova. So far, I think
> they're a lot more similar than people think. Ironic is the weirdest one
> we have (IMHO and no offense to the ironic folks) and it can support
> configdrive properly.

I was going to ask this. Even if the contents of the disk can't be
transferred in advance... how does ironic do this? There must be a
way.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder][nova] about re-image the volume

2018-04-06 Thread Matthew Booth
On 6 April 2018 at 09:31, Gorka Eguileor <gegui...@redhat.com> wrote:
> On 05/04, Matt Riedemann wrote:
>> On 4/5/2018 3:15 AM, Gorka Eguileor wrote:
>> > But just to be clear, Nova will have to initialize the connection with
>> > the re-imagined volume and attach it again to the node, as in all cases
>> > (except when defaulting to downloading the image and dd-ing it to the
>> > volume) the result will be a new volume in the backend.
>>
>> Yeah I think I pointed this out earlier in this thread on what I thought the
>> steps would be on the nova side with respect to creating a new empty
>> attachment to keep the volume 'reserved' while we delete the old attachment,
>> re-image the volume, and then update the volume attachment for the new
>> connection. I think that would be similar to how shelve and unshelve works
>> in nova.
>>
>> Would this really require a swap volume call from Cinder? I'd hope not since
>> swap volume in itself is a pretty gross operation on the nova side.
>>
>> --
>>
>> Thanks,
>>
>> Matt
>>
>
> Hi Matt,
>
> Yes, it will require a volume swap, with the worst case scenario
> exception where we dd the image into the volume.

I think you're talking at cross purposes here: this won't require a
swap volume. Apart from anything else, swap volume only works on an
attached volume, and as previously discussed Nova will detach and
re-attach.

Gorka, the Nova api Matt is referring to is called volume update
externally. It's the operation required for live migrating an attached
volume between backends. It's called swap volume internally in Nova.

Matt

>
> In the same way that anyone would expect a re-imaging preserving the
> volume id, one would also expect it to behave like creating a new volume
> from the same image: be as fast and take up as much space on the
> backend.
>
> And to do so we have to use existing optimized mechanisms that will only
> work when creating a new volume.
>
> The alternative would be to have the worst case scenario as the default
> (attach and dd the image) and make *ALL* Cinder drivers implement the
> optimized mechanism where they can efficiently re-imagine a volume.  I
> can't talk for the Cinder team, but I for one would oppose this
> alternative.
>
> Cheers,
> Gorka.
>
> ______
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] about rebuild instance booted from volume

2018-03-14 Thread Matthew Booth
On 14 March 2018 at 13:46, Matt Riedemann <mriede...@gmail.com> wrote:

> On 3/14/2018 3:42 AM, 李杰 wrote:
>
>>
>>  This is the spec about  rebuild a instance booted from
>> volume.In the spec,there is a
>>question about if we should delete the old root_volume.Anyone who
>> is interested in
>>booted from volume can help to review this. Any suggestion is
>> welcome.Thank you!
>>The link is here.
>>Re:the rebuild spec:https://review.openstack.org/#/c/532407/
>>
>
> Copying the operators list and giving some more context.
>
> This spec is proposing to add support for rebuild with a new image for
> volume-backed servers, which today is just a 400 failure in the API since
> the compute doesn't support that scenario.
>
> With the proposed solution, the backing root volume would be deleted and a
> new volume would be created from the new image, similar to how boot from
> volume works.
>
> The question raised in the spec is whether or not nova should delete the
> root volume even if its delete_on_termination flag is set to False. The
> semantics get a bit weird here since that flag was not meant for this
> scenario, it's meant to be used when deleting the server to which the
> volume is attached. Rebuilding a server is not deleting it, but we would
> need to replace the root volume, so what do we do with the volume we're
> replacing?
>
> Do we say that delete_on_termination only applies to deleting a server and
> not rebuild and therefore nova can delete the root volume during a rebuild?
>
> If we don't delete the volume during rebuild, we could end up leaving a
> lot of volumes lying around that the user then has to clean up, otherwise
> they'll eventually go over quota.
>
> We need user (and operator) feedback on this issue and what they would
> expect to happen.
>

My 2c was to overwrite, not delete the volume[1]. I believe this preserves
both sets of semantics: the server is rebuilt, and the volume is not
deleted. The user will still lose their data, of course, but that's implied
by the rebuild they explicitly requested. The volume id will remain the
same.

[1] I suspect this would require new functionality in cinder to
re-initialize from image.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] [Cyborg] Tracking multiple functions

2018-03-03 Thread Matthew Booth
On 2 March 2018 at 14:31, Jay Pipes <jaypi...@gmail.com> wrote:

> On 03/02/2018 02:00 PM, Nadathur, Sundar wrote:
>
>> Hello Nova team,
>>
>>  During the Cyborg discussion at Rocky PTG, we proposed a flow for
>> FPGAs wherein the request spec asks for a device type as a resource class,
>> and optionally a function (such as encryption) in the extra specs. This
>> does not seem to work well for the usage model that I’ll describe below.
>>
>> An FPGA device may implement more than one function. For example, it may
>> implement both compression and encryption. Say a cluster has 10 devices of
>> device type X, and each of them is programmed to offer 2 instances of
>> function A and 4 instances of function B. More specifically, the device may
>> implement 6 PCI functions, with 2 of them tied to function A, and the other
>> 4 tied to function B. So, we could have 6 separate instances accessing
>> functions on the same device.
>>
>
Does this imply that Cyborg can't reprogram the FPGA at all?


>
>> In the current flow, the device type X is modeled as a resource class, so
>> Placement will count how many of them are in use. A flavor for ‘RC
>> device-type-X + function A’ will consume one instance of the RC
>> device-type-X.  But this is not right because this precludes other
>> functions on the same device instance from getting used.
>>
>> One way to solve this is to declare functions A and B as resource classes
>> themselves and have the flavor request the function RC. Placement will then
>> correctly count the function instances. However, there is still a problem:
>> if the requested function A is not available, Placement will return an
>> empty list of RPs, but we need some way to reprogram some device to create
>> an instance of function A.
>>
>
> Clearly, nova is not going to be reprogramming devices with an instance of
> a particular function.
>
> Cyborg might need to have a separate agent that listens to the nova
> notifications queue and upon seeing an event that indicates a failed build
> due to lack of resources, then Cyborg can try and reprogram a device and
> then try rebuilding the original request.
>

It was my understanding from that discussion that we intend to insert
Cyborg into the spawn workflow for device configuration in the same way
that we currently insert resources provided by Cinder and Neutron. So while
Nova won't be reprogramming a device, it will be calling out to Cyborg to
reprogram a device, and waiting while that happens.

My understanding is (and I concede some areas are a little hazy):

* The flavors says device type X with function Y
* Placement tells us everywhere with device type X
* A weigher orders these by devices which already have an available
function Y (where is this metadata stored?)
* Nova schedules to host Z
* Nova host Z asks cyborg for a local function Y and blocks
  * Cyborg hopefully returns function Y which is already available
  * If not, Cyborg reprograms a function Y, then returns it

Can anybody correct me/fill in the gaps?

Matt


-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder][nova] Update attachments on replication failover

2018-02-27 Thread Matthew Booth
Couple of thoughts:

Sounds like the work Nova will have to do is identical to volume update
(swap volume). i.e. Change where a disk's backing store is without actually
changing the disk.

Multi-attach! There might be more than 1 instance per volume, and we can't
currently support volume update for multi-attached volumes.

Matt

On 27 February 2018 at 09:45, Matt Riedemann <mriede...@gmail.com> wrote:

> On 2/26/2018 9:52 PM, John Griffith wrote:
>
>> ​Yeah, it seems like this would be pretty handy with what's there.  So
>> are folks good with that?  Wanted to make sure there's nothing contentious
>> there before I propose a spec on the Nova and Cinder sides. If you think it
>> seems at least worth proposing I'll work on it and get something ready as a
>> welcome home from Dublin gift for everyone :)
>>
>
> I'll put it on the nova/cinder PTG etherpad agenda for Thursday morning.
> This seems like simple plumbing on the nova side, so not any major problems
> from me.
>
> --
>
> Thanks,
>
> Matt
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Requesting eyes on fix for bug 1686703

2018-02-01 Thread Matthew Booth
On 31 January 2018 at 16:32, Matt Riedemann <mriede...@gmail.com> wrote:

> On 1/31/2018 7:30 AM, Matthew Booth wrote:
>
>> Could I please have some eyes on this bugfix:
>> https://review.openstack.org/#/c/462521/ . I addressed an issue raised
>> in August 2017, and it's had no negative feedback since. It would be good
>> to get this one finished.
>>
>
> First, I'd like to avoid setting a precedent of asking for reviews in the
> ML. So please don't do this.
>

I don't generally do this, but I think a polite request after 6 months or
so is reasonable when something has fallen through the cracks.


> Second, this is a latent issue, and we're less than two weeks to RC1, so
> I'd prefer that we hold this off until Rocky opens up in case it introduces
> any regressions so we at least have time to deal with those when we're not
> in stop-ship mode.
>

That's fine. Looks like I have new feedback to address in the meantime
anyway,

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Requesting eyes on fix for bug 1686703

2018-01-31 Thread Matthew Booth
Could I please have some eyes on this bugfix:
https://review.openstack.org/#/c/462521/ . I addressed an issue raised in
August 2017, and it's had no negative feedback since. It would be good to
get this one finished.

Thanks,

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova]Nova rescue inject pasword failed

2018-01-29 Thread Matthew Booth
On 29 January 2018 at 09:27, 李杰 <li...@unitedstack.com> wrote:

>  Hi,all:
>   I want to access to my instance under rescue state using
> temporary password which nova rescue gave me.But this password doesn't
> work. Can I ask how this password is injected to instance? I can't find any
> specification how is it done.I saw the code about rescue,But it displays
> the password has inject.
>   I use the libvirt as the virt driver. The web said to
> set"[libvirt]inject_password=true",but it didn't work. Is it a bug?Can
> you give me some advice?Help in troubleshooting this issue will be
> appreciated.
>

Ideally your rescue image will support cloud-init and you would use a
config disk. For password injection to work you need inject_password=True,
inject_partition=-1 (*NOT* -2, which is the default), and for libguestfs to
be correctly installed on your compute hosts.

But to reiterate, ideally your rescue image would support cloud-init and
you would use a config disk.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Native QEMU LUKS decryption review overview ahead of FF

2018-01-25 Thread Matthew Booth
On 24 January 2018 at 22:57, Matt Riedemann <mriede...@gmail.com> wrote:

> On 1/22/2018 8:22 AM, Lee Yarwood wrote:
>
>> Hello,
>>
>> With M3 and FF rapidly approaching this week I wanted to post a brief
>> overview of the QEMU native LUKS series.
>>
>> The full series is available on the following topic, I'll go into more
>> detail on each of the changes below:
>>
>> https://review.openstack.org/#/q/topic:bp/libvirt-qemu-nativ
>> e-luks+status:open
>>
>> libvirt: Collocate encryptor and volume driver calls
>> https://review.openstack.org/#/c/460243/ (Missing final +2 and +W)
>>
>> This refactor of the Libvirt driver connect and disconnect volume code
>> has the added benefit of also correcting a number of bugs around the
>> attaching and detaching of os-brick encryptors. IMHO this would be
>> useful in Queens even if the rest of the series doesn't land.
>>
>> libvirt: Introduce disk encryption config classes
>> https://review.openstack.org/#/c/464008/ (Missing final +2 and +W)
>>
>> This is the most straight forward change of the series and simply
>> introduces the required config classes to wire up native LUKS decryption
>> within the domain XML of an instance. Hopefully nothing controversial.
>>
>> libvirt: QEMU native LUKS decryption for encrypted volumes
>> https://review.openstack.org/#/c/523958/ (Missing both +2s and +W)
>>
>> This change carries the bulk of the implementation, wiring up encrypted
>> volumes during their initial attachment. The commit message has a
>> detailed run down of the various upgrade and LM corner cases we attempt
>> to handle here, such as LM from a P to Q compute, detaching a P attached
>> encrypted volume after upgrading to Q etc.
>>
>> Upgrade and LM testing is enabled by the following changes:
>>
>> fixed_key: Use a single hardcoded key across devstack deployments
>> https://review.openstack.org/#/c/536343/
>>
>> compute: Introduce an encrypted volume LM test
>> https://review.openstack.org/#/c/536177/
>>
>> This is being tested by tempest-dsvm-multinode-live-migration and
>> grenade-dsvm-neutron-multinode-live-migration in the following DNM Nova
>> change, enabling volume backed LM tests:
>>
>> DNM: Test LM with encrypted volumes
>> https://review.openstack.org/#/c/536350/
>>
>> Hopefully that covers everything but please feel free to ping if you
>> would like more detail, background etc. Thanks in advance,
>>
>> Lee
>>
>>
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
> The patch is already approved, and I asked melwitt to write a release
> note, at which point it was noted that swap volume will not work with
> native luks encrypted volumes. That's a regression.
>

It's only a regression since swap_volume with encrypted volumes was fixed
in https://review.openstack.org/#/c/460243/, which landed on Monday as part
of this series. Prior to Monday, swap_volume with encrypted volumes would
result in the raw encrypted volume being presented to the guest after the
swap.

We need to at least report a nova bug for this so we can work on some kind
> of fallback to the non-native decryption until there is a libvirt/qemu fix
> upstream and we can put version conditionals in place for when we can
> support swap volume with a native luks-encrypted volume.
>

In the context of the above, I don't think this is a priority as clearly
nobody is currently doing it. There's already a bug to track the problem in
libvirt, which is linked in a code comment. Admittedly that BZ is
unnecessarily private, which I noted in review, but we've reached out to
the author to ask them to open it up as there's nothing sensitive going on
in there.

In general, anything qemu can do natively makes Nova both simpler and more
robust because we don't have to modify the host configuration. This
eliminates a whole class of error states and race conditions, because when
we kill qemu there's nothing left to clean up.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Local disk serial numbers series for reviewers: update 9th Jan

2018-01-19 Thread Matthew Booth
On 9 January 2018 at 15:28, Matthew Booth <mbo...@redhat.com> wrote:

> In summary, the patch series is here:
>
> https://review.openstack.org/#/q/status:open+project:opensta
> ck/nova+branch:master+topic:bp/local-disk-serial-numbers
>
> The bottom 3 patches, which add BDM.uuid have landed. The next 3 currently
> have a single +2. Since I last posted I have found and fixed a problem in
> swap_volume, which added 2 more patches to the series. There are currently
> 13 outstanding patches in the series.
>
> The following 6 patches are the 'crux' patches. The others in the series
> are related fixes/cleanups (mostly renaming things and fixing tests) which
> I've moved into separate patches to reduce noise.
>
> Add DriverLocalImageBlockDevice:
> https://review.openstack.org/#/c/526347/6
>

This now has a +2!

Add local_root to block_device_info:
> https://review.openstack.org/#/c/529029/6
>
> Pass DriverBlockDevice to driver.attach_volume
> https://review.openstack.org/#/c/528363/
>
> Expose volume host type and path independent of libvirt config
> https://review.openstack.org/#/c/530786/
>
> Don't generate fake disk_info in swap_volume
> https://review.openstack.org/#/c/530787/
>
> Local disk serial numbers for the libvirt driver
> https://review.openstack.org/#/c/529380/
>

These remain the crux patches.

Some of the simpler cleanup patches have also attracted +2s. As I've had to
rebase the series due to merge conflicts a couple of times, I've moved
these to the bottom so they can potentially go straight in. Cleanup patches
with a single +2 already:

https://review.openstack.org/#/c/531179/
https://review.openstack.org/#/c/526346/
https://review.openstack.org/#/c/528362/

Thanks,

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Local disk serial numbers series for reviewers: update 9th Jan

2018-01-09 Thread Matthew Booth
In summary, the patch series is here:

https://review.openstack.org/#/q/status:open+project:
openstack/nova+branch:master+topic:bp/local-disk-serial-numbers

The bottom 3 patches, which add BDM.uuid have landed. The next 3 currently
have a single +2. Since I last posted I have found and fixed a problem in
swap_volume, which added 2 more patches to the series. There are currently
13 outstanding patches in the series.

The following 6 patches are the 'crux' patches. The others in the series
are related fixes/cleanups (mostly renaming things and fixing tests) which
I've moved into separate patches to reduce noise.

Add DriverLocalImageBlockDevice:
https://review.openstack.org/#/c/526347/6

Add local_root to block_device_info:
https://review.openstack.org/#/c/529029/6

Pass DriverBlockDevice to driver.attach_volume
https://review.openstack.org/#/c/528363/

Expose volume host type and path independent of libvirt config
https://review.openstack.org/#/c/530786/

Don't generate fake disk_info in swap_volume
https://review.openstack.org/#/c/530787/

Local disk serial numbers for the libvirt driver
https://review.openstack.org/#/c/529380/

Thanks,

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Overview of local disk serial numbers series for reviewers

2017-12-21 Thread Matthew Booth
This series is ready for review. As normal I've done my best to break it up
into logically separate changes, so it's currently 15 patches long. I've
also done my best to write commit messages with the reviewer in mind. If
something looks weird, please check if I called it out in the commit
message.

This email is an overview of the series to help see all 15 patches in
context. The spec is here:

https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/local-disk-serial-numbers.html

The patch series is here:

https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/local-disk-serial-numbers

* Add a uuid to BlockDeviceMapping

  https://review.openstack.org/242602 Add uuid column to BlockDeviceMapping
  https://review.openstack.org/242603 Make BlockDeviceMapping object
support uuid
  https://review.openstack.org/525599 Add an online migration for BDM.uuid

The first 2 were actually originally Dan Smith's patches from 2015,
resurrected for the second time. I've cribbed heavily from the original
patches and also the recent migration uuid patches. I've added a bunch of
robustification to the potentially racy bits, though, so they differ in
some details.

We want the libvirt driver to use this uuid as a serial number when
creating a guest. We now need to expose it to the libvirt driver for all
disks.

* Expose BlockDeviceMapping.uuid to drivers

  https://review.openstack.org/524167 DriverBlockDevice: make subclasses
inherit _proxy_as_attr
  https://review.openstack.org/529037 Expose BDM uuid to drivers

Drivers don't consume BlockDeviceMapping directly. Instead,
BlockDeviceMapping objects are translated to DriverBlockDevice objects and
passed in the block_device_info struct. These 2 changes simply add uuid to
existing DriverBlockDevice objects.

* Expose the local root disk BDM to drivers

  https://review.openstack.org/526346 Give volume DriverBlockDevice classes
a common prefix
  https://review.openstack.org/526347 Add DriverLocalImageBlockDevice
  https://review.openstack.org/529028 Rename block_device_info_get_root
  https://review.openstack.org/529029 Add local_root to block_device_info

Unfortunately, for reasons I'm still not entirely clear on
block_device_info has never contained a DriverBlockDevice object
representing a local root disk. The patches create a new DriverBlockDevice
class for a local root disk, and add it to block_device_info. We do this in
such a way that drivers which don't expect it to be there will never notice
it.

At this point in the series, drivers have access to BDM.uuid for all of an
instance's disks.

* Fix an array of incorrect uses of DriverBlockDevice and block_device_info

  https://review.openstack.org/528362 Expose driver_block_device fields as
attributes
  https://review.openstack.org/528363 Pass DriverBlockDevice to
driver.attach_volume
  https://review.openstack.org/527916 Use real block_device_info data in
libvirt tests
  https://review.openstack.org/529328 Fix libvirt volume tests passing
invalid disk_info

We're going to make changes which require BDM.uuid to be available.
Unfortunately we hardcode BDM data all over the place. That's all going to
break, even when the change isn't relevant to the test in question, which
it isn't 95% of the time. This mini-series focuses on using real data
everywhere we can. This reduces a ton of noise later.

The specific driver for the second patch in the series (to
driver.attach_volume) is that the libvirt driver needs a DriverBlockDevice
there, and was creating a fake one in non-test code.

At this point in the series all relevant test and non-test code is actually
using the new data. This means we can make changes in the libvirt driver
which require this new data without breaking everything.

* Implement local disk serial numbers for the libvirt driver

  https://review.openstack.org/529329 Pass disk_info dict to libvirt_info
  https://review.openstack.org/529380 Local disk serial numbers for the
libvirt driver

Here we finally make the libvirt driver-specific changes to expose BDM uuid
as a serial number for local disks.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Boston Forum session recap - searchlight integration

2017-05-22 Thread Matthew Booth
__
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Multicell paginated instance lists

2017-05-09 Thread Matthew Booth
Following the forum session yesterday I've thrown up this pseudocode for
discussion:

https://review.openstack.org/463618

I want to stress that I consider nova proxying to searchlight to be a
premature optimisation rather than an intrinsically terrible idea. That
said, the searchlight solution looks very complex for both developers and
operators, and very fragile. I think we'd better going with a relatively
simple solution like this one first, and only going a couple of orders of
magnitude more complex if it turns out to be absolutely essential.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Use data loss bug on error during resize (migration datamodel issue)

2017-04-27 Thread Matthew Booth
(Sorry: resent properly tagged)

We've encountered a bug in resize which resulted in data loss. The gist is
that user was resizing a qcow2 instance whose image had been deleted from
glance. In driver.finish_migration on the destination host, an error
occurred attempting to copy the image from the source hosts's image cache,
putting the instance into an error state. Note that instance.host has been
set to the destination host before finish_migration runs. When the image
cache cleanup ran on the source host, the instance is no longer in the list
of expected instances on that host because instance.host == dest. Image
cache manager expired the image from the cache, and there was no other copy
of the image.

Let's ignore the root cause of the side-loading error, because that's the
type of transient error which can always occur. I'm looking for a way to
avoid deleting the image from the image cache in future until the resize
operation has completed. The obvious way to do this is to update the
instance list generated in ComputeManager._run_image_cache_manager_pass to
consider not only instances where instance.host is in the node list, but
also any instance with a migration record where source/dest is in the node
list.

The problem with this is that the data model doesn't seem to allow us to
fetch the currently active migration. Following the error above, the
errors_out_migration decorator on finish_resize has set the migration to an
error state. AFAICT this is never deleted, so the presence of a migration
in an error state only means that a migration involving this instance has
occurred in the past. It doesn't mean that it's currently relevant, so it's
basically meaningless.

Firstly, have I missed any semantics of the migration record which might
allow to me to unambiguously identify currently relevant migrations,
whether in an error state or otherwise? That would be ideal, and I'd just
go with that.

If not, how about adding an active migration field to instance? I don't
think it would ever make sense to have more than 1 current migration for a
given instance. It would be set back to NULL when the migration was
complete, and we'd at least have an opportunity to do something explicit
with migrations in an error state.

In the meantime I'm going to look for more backportable avenues to fix
this. Perhaps not updating instance.host until after finish_migration.

Matt

-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Use data loss bug on error during resize (migration datamodel issue)

2017-04-27 Thread Matthew Booth
We've encountered a bug in resize which resulted in data loss. The gist is
that user was resizing a qcow2 instance whose image had been deleted from
glance. In driver.finish_migration on the destination host, an error
occurred attempting to copy the image from the source hosts's image cache,
putting the instance into an error state. Note that instance.host has been
set to the destination host before finish_migration runs. When the image
cache cleanup ran on the source host, the instance is no longer in the list
of expected instances on that host because instance.host == dest. Image
cache manager expired the image from the cache, and there was no other copy
of the image.

Let's ignore the root cause of the side-loading error, because that's the
type of transient error which can always occur. I'm looking for a way to
avoid deleting the image from the image cache in future until the resize
operation has completed. The obvious way to do this is to update the
instance list generated in ComputeManager._run_image_cache_manager_pass to
consider not only instances where instance.host is in the node list, but
also any instance with a migration record where source/dest is in the node
list.

The problem with this is that the data model doesn't seem to allow us to
fetch the currently active migration. Following the error above, the
errors_out_migration decorator on finish_resize has set the migration to an
error state. AFAICT this is never deleted, so the presence of a migration
in an error state only means that a migration involving this instance has
occurred in the past. It doesn't mean that it's currently relevant, so it's
basically meaningless.

Firstly, have I missed any semantics of the migration record which might
allow to me to unambiguously identify currently relevant migrations,
whether in an error state or otherwise? That would be ideal, and I'd just
go with that.

If not, how about adding an active migration field to instance? I don't
think it would ever make sense to have more than 1 current migration for a
given instance. It would be set back to NULL when the migration was
complete, and we'd at least have an opportunity to do something explicit
with migrations in an error state.

In the meantime I'm going to look for more backportable avenues to fix
this. Perhaps not updating instance.host until after finish_migration.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][ceilometer][postgresql][gate][telemetry] PostgreSQL gate failure (again)

2017-02-03 Thread Matthew Booth
On Thu, Feb 2, 2017 at 4:42 PM, Sean Dague <s...@dague.net> wrote:

> On 02/02/2017 10:33 AM, Mike Bayer wrote:
> >
> >
> > On 02/01/2017 10:22 AM, Monty Taylor wrote:
> >>
> >> I personally continue to be of the opinion that without an explicit
> >> vocal and well-staffed champion, supporting postgres is more trouble
> >> than it is worth. The vast majority of OpenStack deployments are on
> >> MySQL - and what's more, the code is written with MySQL in mind.
> >> Postgres and MySQL have different trade offs, different things each are
> >> good at and different places in which each has weakness. By attempting
> >> to support Postgres AND MySQL, we prevent ourselves from focusing
> >> adequate attention on making sure that our support for one of them is
> >> top-notch and in keeping with best practices for that database.
> >>
> >> So let me state my opinion slightly differently. I think we should
> >> support one and only one RDBMS backend for OpenStack, and we should open
> >> ourselves up to use advanced techniques for that backend. I don't
> >> actually care whether that DB is MySQL or Postgres - but the corpus of
> >> existing deployments on MySQL and the existing gate jobs I think make
> >> the choice one way or the other simple.
> >
> >
> > well, let me blow your mind and agree, but noting that this means, *we
> > drop SQLite also*.   IMO every openstack developer should have
> > MySQL/MariaDB running on their machine and that is part of what runs if
> > you expect to run database-related unit tests.   Targeting just one
> > database is very handy but if you really want to use the features
> > without roadblocks, you need to go all the way.
>
> That's all fine and good, we just need to rewrite about 100,000 unit
> tests to do that. I'm totally cool with someone taking that task on, but
> making a decision about postgresql shouldn't be filibustered on
> rewriting all the unit tests in OpenStack because of the ways we use
> sqlite.
>

I wrote a patch series to optionally run all our unit tests using MySQL
instead of sqlite a couple of years ago, and it wasn't that hard at the
time. The biggest issue I recall was fixing up tests which assumed
sub-second timestamp granularity which MySQL did not support at the time
(but may now).

IIRC the series died because we killed the fixture I was using in oslo.db
without replacement before my series finished landing. Fundamentally wasn't
that hard, though.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Status of imagebackend refactor patches

2016-11-29 Thread Matthew Booth
I'm continuing to work through updating the imagebackend refactor patches.
A bunch of them at the top of the queue already have +2 from jaypipes, who
expressed an interest in getting some of these merged. I think the
following could be ready to go (in order):

https://review.openstack.org/#/c/337159/  libvirt: Rewrite
_test_finish_migration [1]
https://review.openstack.org/#/c/338993/  libvirt: Test disk creation in
test_hard_reboot
https://review.openstack.org/#/c/339114/  libvirt: Cleanup
test_create_configdrive
https://review.openstack.org/#/c/333272/  libvirt: Rename Backend snapshot
and image [2]
https://review.openstack.org/#/c/331115/  libvirt: Never copy a swap disk
during cold migration
https://review.openstack.org/#/c/331118/  libvirt: Don't re-resize disks in
finish_migration() [3]

[1] This has -1 from melwitt, but I don't personally think this issue is
worth a respin. However, she mentioned on IRC that there may be more, so
that could be moot. Anyway, being the top of the queue this is obviously a
blocker.

[2] This doesn't have +2 from jaypipes, although it's uncontroversial and
uncomplicated. Hopefully just an oversight?

[3] This patch removes the function which melwitt identified in [1] lost
some test coverage, which is why I'm hoping not to respin for that.

I'll leave it there for the moment, because I need to talk through the next
patch with ftersin. It looks like it will require a release note at least,
if not a change.

If anybody would like to slog through some of the above and add a second +2
I'd be very grateful. There's plenty more in the queue after those!

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Follow up on BCN review cadence discussions

2016-11-07 Thread Matthew Booth
I'd like to follow up on the discussions we had in Barcelona around review
cadence. I took a lot away from these discussions, and I thought they were
extremely productive. I think the summary of the concerns was:

  Nova is a complex beast, very few people know even most of it well.
  There are areas of Nova where mistakes are costly and hard to rectify
later.
  A large amount of good code does not merge quickly.
  The pool of active core reviewers is very much smaller than the total
pool of contributors.
  The barrier to entry for Nova core is very high.

There was more, but I think most people agreed on the above.

Just before summit I pitched a subsystem maintainer model here:

http://lists.openstack.org/pipermail/openstack-dev/2016-September/104093.html

Reading over this again, I still think this is worth trialling: it pushes
the burden of initial detailed review further out, whilst ensuring that a
core will still check that no wider issues have been missed. Bearing in
mind that the primary purpose is to reduce the burden on core reviewers of
any individual patch, I think this will work best if we aggressively push
initial review down as far as possible.

I'm still not sure what the best description of 'domain' is, though. Other
projects seem to have defined this as: the reviewer should, in good faith,
know when they're competent to give a +2 and when they're not. I suspect
this is too loose for us, but I'm sure we could come up with something.

I'd expect the review burden of a maintainer of an active area of code to
be as heavy as a core reviewer's, so this probably shouldn't be taken
lightly. If we were to trial it, I'd also recommend starting with a small
number of maintainers (about 3, perhaps), preferably technical-leaders of
active sub-teams. Any trial should be the topic of a retrospective at the
PTG.

I'd like to re-iterate that what I'd personally like to achieve is:

  "Good code should merge quickly."

There are likely other ways to achieve this, and if any of them are better
than this one then I'm in favour. However, I think we can try this one with
limited risk and initial up-front effort.

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Moving storage placement out of the driver

2016-11-01 Thread Matthew Booth
For context, this is a speculative: should we, shouldn't we?

The VMware driver currently allows the user to specify what storage their
instance will use, I believe using a flavor extra-spec. We've also got
significant interest in adding the same to the libvirt driver, but we've
got some additional plumbing work to do first to allow us to persist this
metadata per-instance, independent of the host configuration. It was my
intention to add this logic to the libvirt driver along similar lines.

At the same time, we're adding resource providers. It's my layman's
understanding that a storage pool (in the abstract, not referring to a
particular implementation) will be modelled as a resource provider. A host
can, theoretically at least, have arbitrarily many of these, and one may be
attached to multiple hosts.

As it stands today, ComputeManager specifies what disks an instance should
have, and how large they should be. Where to put them is down to the
driver. If we're modelling this outside the driver and at least 2 drivers
are implementing it, I wonder if we shouldn't be implementing storage
policy at a higher level than the driver.

Thoughts?

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [libvirt] Debugging blockRebase() - "active block copy not ready for pivot"

2016-10-11 Thread Matthew Booth
On Mon, Oct 10, 2016 at 5:54 PM, Matt Riedemann <mrie...@linux.vnet.ibm.com>
wrote:

> On 10/10/2016 6:26 AM, Kashyap Chamarthy wrote:
>
>>
>> Also, maybe you have noticed, given Eric's reply on this thread (and the
>> upstream libvir-list), it is agreed that virDomainGetBlockJobInfo()
>> should be fixed "to quit reporting cur==end when ready:false".  This
>> allows the existing Nova code work correctly without any changes.
>>
>> Discusion on Eric's reply in this thread:
>>
>> http://lists.openstack.org/pipermail/openstack-dev/2016-Octo
>> ber/105194.html
>>
>> And upstream libvir-list
>>
>> https://www.redhat.com/archives/libvir-list/2016-October/
>> msg00347.html
>>
>>
> Yeah, that's goodness long-term, but OpenStack CI won't benefit from that
> for probably at least a couple of years.


The hypervisor is a (the?) critical component of any cloud deployment.
Objectively, it's bizarre that we expect people to deploy our brand new
code to work round things that were fixed in the hypervisor 2 years ago.

Given that's where we are, though, I agree.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][libvirt] Lets make libvirt's domain XML canonical

2016-09-30 Thread Matthew Booth
On Fri, Sep 30, 2016 at 4:38 PM, Murray, Paul (HP Cloud) <pmur...@hpe.com>
wrote:

>
>
>
>
>
> On 27/09/2016, 18:12, "Daniel P. Berrange" <berra...@redhat.com> wrote:
>
> >On Tue, Sep 27, 2016 at 10:40:34AM -0600, Chris Friesen wrote:
> >> On 09/27/2016 10:17 AM, Matthew Booth wrote:
> >>
> >> > I think we should be able to create a domain, but once created we
> should never
> >> > redefine a domain. We can do adding and removing devices dynamically
> using
> >> > libvirt's apis, secure in the knowledge that libvirt will persist
> this for us.
> >> > When we upgrade the host, libvirt can ensure we don't break guests
> which are on
> >> > it. Evacuate should be pretty much the only reason to start again.
> >>
> >> Sounds interesting.  How would you handle live migration?
> >>
> >> Currently we regenerate the XML file on the destination from the nova
> DB.  I
> >> guess in your proposal we'd need some way of copying the XML file from
> the
> >> source to the dest, and then modifying the appropriate segments to
> adjust
> >> things like CPU/NUMA pinning?
> >
> >Use the flag VIR_MIGRATE_PERSIST_XML and libvirt will write out the
> >new persistent XML on the target host automatically.
>
> We have a problem migrating rescued instances that has a fix in progress
> based
> on regenerating the xml on unrescue, see:
>
>  https://blueprints.launchpad.net/nova/+spec/live-migrate-
> rescued-instances
>
> That might be another case for generating the xml.
>

I thought it was a counter-argument (unless I've misunderstood). If you
migrate the instance as-is without modification, you don't need to worry
about whether it's currently a rescue instance. This problem goes away.

The major complication I can think of is things which genuinely must change
during a migration, for example cpu pinning.

Matt

-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] reason(s) for not booting instances from (bootable) volumes

2016-09-30 Thread Matthew Booth
Nothing architectural that I can think of, we just don't support it.
Rebuild current takes an image, but I guess it could reasonable take most
of the arguments to spawn. We'd have to think hard about what it meant for
arguments to be present/omitted, i.e. are we specifying new disks, or
redefining old ones, but if the intent could be represented in the api it
could certainly be implemented.

Matt

On Fri, Sep 30, 2016 at 12:32 PM, Tikkanen, Viktor (Nokia - FI/Espoo) <
viktor.tikka...@nokia.com> wrote:

> Hi!
>
> I refer here to my question sent earlier to AskOpenstack:
>
> https://ask.openstack.org/en/question/97425/why-it-is-not-
> possible-to-rebuild-an-instance-from-a-bootable-volume/
>
> Are there some (architectural, performance, security etc.) reasons why
> rebuilding instances from (bootable) volumes is not implemented/supported?
>
> -Viktor
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][libvirt] Lets make libvirt's domain XML canonical

2016-09-27 Thread Matthew Booth
Currently the libvirt driver (mostly) considers the nova db canonical. That
is, we can throw away libvirt's domain XML at any time and recreate it from
Nova. Anywhere that doesn't assume this is a bug, because whatever
direction we choose we don't need 2 different sources of truth. The
thinking behind this is that we should always know what we told libvirt,
and if we lose that information then that's a bug.

This is true to a degree, and it's the reason I proposed the persistent
instance storage metadata spec: we lose track of how we configured an
instance's storage. I realised recently that this isn't the whole story,
though. Libvirt also automatically creates a bunch of state for us which we
didn't specify explicitly. We lose this every time we drop it and recreate.
For example, consider device addressing and ordering:

$ nova boot ...

We tell libvirt to give us a root disk, config disk, and a memballoon
device (amongst other things).

Libvirt assigns pci addresses to all of these things.

$ nova volume-attach ...

We tell libvirt to create a new disk attached to the given volume.

Libvirt assigns it a pci address.

$ nova reboot

We throw away libvirt's domain xml and create a new one from scratch.

Libvirt assigns new addresses for all of these devices.

Before reboot, the device order was: root disk, config disk, memballoon,
volume. After reboot the device order is: root disk, volume, config disk,
memballoon. Not only have all our devices changed address, which makes
Windows sad and paranoid about its licensing, and causes it to offline
volumes under certain circumstances, but our disks have been reordered.

This isn't all we've thrown away, though. Libvirt also gave us a default
machine type. When we create a new domain we'll get a new default machine
type. If libvirt has been upgraded, eg during host maintenance, this isn't
necessarily what it was before. Again, this can make our guests sad. Same
goes for CPU model, default devices, and probably many more things I
haven't thought of.

Also... we lost the storage configuration of the guest: the information I
propose to persist in persistent instance storage metadata.

We could store all of this information in Nova, but with the possible
exception of storage metadata it really isn't at the level of 'management':
it's the minutia of the hypervisor. In order to persist all of these things
in Nova we'd have to implement them explicitly, and when libvirt/kvm grows
more stuff we'll have to do that too. We'll need to mirror the
functionality of libvirt in Nova, feature for feature. This is a red flag
for me, and I think it means we should switch to libvirt being canonical.

I think we should be able to create a domain, but once created we should
never redefine a domain. We can do adding and removing devices dynamically
using libvirt's apis, secure in the knowledge that libvirt will persist
this for us. When we upgrade the host, libvirt can ensure we don't break
guests which are on it. Evacuate should be pretty much the only reason to
start again.

This would potentially obsolete my persistent instance metadata spec, and
the libvirt stable rescue spec, as well as this one:
https://review.openstack.org/#/c/347161/ .

I raised this in the live migration sub-team meeting, and the immediate
response was understandably conservative. I think this solves more problems
than it creates, though, and it would result in Nova's libvirt driver
getting a bit smaller and a bit simpler. That's a big win in my book.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] TL; DR A proposed subsystem maintainer model

2016-09-20 Thread Matthew Booth
* +A is reserved for cores.

* A subsystem maintainer's domain need not be defined by the directory
tree: subteams are a good initial model.

* A maintainer should only +2 code in their own domain in good faith,
enforced socially, not technically.

* Subsystem maintainer is expected to request specific additional +2s for
patches touching hot areas, eg rpc, db, api.
  * Hot areas are also good candidate domains for subsystem maintainers.
  * Hot area review need not cover the whole patch if it's not required: I
am +2 on the DB change in this patch.

This model means that code with +2 from a maintainer only requires a single
+2 from a core.

We could implement this incrementally by defining a couple of pilot
subsystem maintainer domains.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [NOVA] why it take long time between the unplug_vif and update port host binding while doing live migrate VM

2016-09-19 Thread Matthew Booth
I've been working on a related problem, and have also noticed this delay.
These calls happen on separate compute hosts, and lots of things happen in
between. Although the goal of my patch isn't to address the issue you
describe, I have noticed that this patch significantly reduces it anyway:

https://review.openstack.org/#/c/369423/

I'd be interested to know if you see the same in your environment. As an
unmerged RPC change, obviously don't go putting this in production!

Matt

On Thu, Sep 8, 2016 at 10:21 AM, Jingting <kangjingt...@sina.com> wrote:

> Hi All:
>
> As [1], [2]  shows,  at the last stage in VM migration, it will call
> unplug_vifs(), and then migrate_instance_finish(), in which update_port
>  will be called immediately.
>
> I think the two steps should be executed without no disturb., but in my
> test,  there is at least 3 seconds between unplug vif and update_port.
>
> As I want to do something with my controller at the two points, so I have
> tested it many times, and the lasting time remains about 3s.
>
> Any comment will be appreciate! Thanks!
>
>
> [1] http://paste.openstack.org/show/198298/
>
> [2] http://paste.openstack.org/show/568983/
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>


-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] Reliable way to filter CI in gerrit spam?

2016-08-31 Thread Matthew Booth
On Wed, Aug 31, 2016 at 5:31 PM, Jeremy Stanley <fu...@yuggoth.org> wrote:

> On 2016-08-31 16:13:16 +0200 (+0200), Jordan Pittier wrote:
> > Most(all?) messages from CI have the lines:
> >
> > "Patch Set X:
> > Build (succeeded|failed)."
> >
> > Not super robust, but that's a start.
>
> Also we have naming conventions for third-party CI accounts that
> suggest they should end in " CI" so you could match on that.
>

Yeah, all except 'Jenkins' :) This makes me a bit nervous because all mail
still comes from 'rev...@openstack.org', with the name component set to the
name of the CI. I was nervous of false positives, so I chose to name them
all in full.


>
> Further, we have configuration in Gerrit to not send E-mail for
> comments from accounts in
> https://review.openstack.org/#/admin/groups/270,members so if you
> are seeing E-mail from Gerrit for third-party CI system comments see
> whether they're in that group already (in which case let the Infra
> team know because we might have a bug to look into) or ask one of
> the members of
> https://review.openstack.org/#/admin/groups/440,members to add the
> stragglers (or even volunteer to become part of that coordinators
> group and help maintain the list).
>

This sounds interesting. All the CIs I get gerrit spam from are on that
list except Jenkins. Do I have to enable something specifically to exclude
them? I mashed all the links I could find seemingly related to gerrit
settings and I couldn't find anything which looked promising.

Thanks again,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [infra] Reliable way to filter CI in gerrit spam?

2016-08-31 Thread Matthew Booth
I've just (re-)written an email filter which splits gerrit emails into
those from CI and those from real people. In general I'm almost never
interested in botmail, but they comprise about 80% of gerrit email.

Having looked carefully through some gerrit emails from real people and
CIs, I unfortunately can't find any features which distinguish the CI.
Consequently my filter is just a big enumeration of current known CIs. This
is kinda ugly, and will obviously get out of date.

Is there anything I missed? Or is it possible to unsubscribe from gerrit
mail from bots? Or is there any other good way to achieve what I'm looking
for which doesn't involve maintaining my own bot list? If not, would it be
feasible to add something?

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Overview of the libvirt instance storage series

2016-07-22 Thread Matthew Booth
' patches here, in order:

* libvirt: Improve mocking of imagebackend disks
  https://review.openstack.org/#/c/333242/

This provides a single, common way for driver tests to mock imagebackend.
There
were a few existing approaches to this which had various deficiencies. This
new
one replaces them all, and allows easy inspection by driver of what went on.
The series relies on this interface a lot, which is why I highlight it here.

* libvirt: Add create_from_image and create_from_func to Backend
  https://review.openstack.org/#/c/333244/

This is the requested 'shim' layer. It updates the driver to call the new
interfaces, and implements these new interfaces in terms of the old
interfaces.
Note that we don't make any significant changes to existing tests in this
change, which validates that the shim layer continues to call the old
interface in the same way as before.

The immediately following change updates the tests to use the new
interfaces.

* libvirt: Add DiskFromImage and DiskFromFunc
  https://review.openstack.org/#/c/333522/

This is the first part of the change to deconstruct _create_image(). These
interfaces allow us to return a 'creatable' disk, which has not actually
been
created. The caller can then decide whether the disk should be created.

Over the next few changes we create some helper methods which return
iterators
over creatable disks.

* libvirt: Don't call _create_image from finish_migration
  https://review.openstack.org/#/c/337160/

This removes the problematic caller of _create_image. finish_migration
doesn't
create disks, so supporting it with _create_image and Image.cache() was
tortured. The new method is explicit about what needs to happen, and to
which
disks.

* libvirt: Replace _create_images_and_backing in pre_live_migration
  https://review.openstack.org/#/c/342224/

This is a big change to pre_live_migration. The purpose is to ensure that
disks
created for live migration are created identically to how they were
originally
created.

* libvirt: Add create_from_image & create_from_func for Qcow2
  https://review.openstack.org/#/c/320610/

The native implementation of the new backend interfaces for the Qcow2
backend.

I'd like to apologise in advance for dropping this immediately before
heading
out on holiday for a week. I will be back at work on Monday 1st August, and
I
obviously won't be able to respond to review feedback before then. In my
absence, Diana Clarke may be able to respond.

This is a big series, though, so I'd greatly appreciate eyes on all of it,
even
if some of the earlier patches receive -1s. If we could knock off even some
of
the earlier test patches, the patch bombs I keep dropping on gerrit will
get a
bit smaller ;)

Thanks,

Matt

[1] https://blueprints.launchpad.net/nova/+spec/libvirt-instance-storage
[2] Flat, Qcow2, Lvm, Rbd, Ploop
[3] For recent examples see stable libvirt rescue, and device tagging.
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] libvirt driver: who should create the libvirt.xml file?

2016-06-22 Thread Matthew Booth
On Mon, Jun 20, 2016 at 4:47 PM, Markus Zoeller <mzoel...@linux.vnet.ibm.com
> wrote:

> White working on the change series to implement the virtlogd feature I
> got feedback [1] to move code which creates parts of the libvirt.xml
> file from the "driver" module into the "guest" module. I'm a bit
> hesitant to do so as the responsibility of creating a valid libvirt.xml
> file is then spread across 3 modules:
> * driver.py
> * guest.py
> * designer.py

I'm only looking for a guideline here (The "driver" module is humongous
> and I think it would be a good thing to have the "libvirt.xml" creation
> code outside of it). Thoughts?
>

I'm also not a huge fan of the division of code for generating libvirt info
different backends in imagebackend. I'm not familiar enough with the xml
generation code to have a firm opinion on how it should be.

My concern looking at [1] is that to the greatest possible extent libvirt
should be a data sink, not a data source. That is: Nova is the canonical
source of truth. In terms of configuration, there should be nothing that
libvirt knows that Nova didn't tell it first, so it should never be
necessary to ask for it. The only reason this might not be true would be
legacy instances, and we should be fixing them up.

So, eg, guest.get_path() shouldn't exist.

Matt


>
> References:
> [1]
> https://review.openstack.org/#/c/323761/2/nova/virt/libvirt/driver.py@4190
>
> --
> Regards, Markus Zoeller (markus_z)
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] A primer on data structures used by Nova to represent block devices

2016-06-16 Thread Matthew Booth
On Thu, Jun 16, 2016 at 4:20 PM, Kashyap Chamarthy <kcham...@redhat.com>
wrote:
[...]

> > BlockDeviceMapping
> > ===
> >
> > The 'top level' data structure is the block device mapping object. It is
> a
> > NovaObject, persisted in the db. Current code creates a BDM object for
> > every disk associated with an instance, whether it is a volume or not. I
> > can't confirm (or deny) that this has always been the case, though, so
> > there may be instances which still exist which have some BDMs missing.
> >
> > The BDM object describes properties of each disk as specified by the
> user.
> > It is initially created by the user and passed to compute api. Compute
> api
> > transforms and consolidates all BDMs to ensure that all disks, explicit
> or
> > implicit, have a BDM, then persists them.
>
> What could be an example of an "implicit disk"?
>

If the flavor defines an ephemeral disk which the user did not specify
explicitly, it will be added. Possibly others, I'm not looking at that code
right now.


>
> > Look in nova.objects.block_device
> > for all BDM fields, but in essence they contain information like
> > (source_type='image', destination_type='local', image_id='),
> > or equivalents describing ephemeral disks, swap disks or volumes, and
> some
> > associated data.
> >
> > Reader note: BDM objects are typically stored in variables called 'bdm'
> > with lists in 'bdms', although this is obviously not guaranteed (and
> > unfortunately not always true: bdm in libvirt.block_device is usually a
> > DriverBlockDevice object). This is a useful reading aid (except when it's
> > proactively confounding), as there is also something else typically
> called
> > 'block_device_mapping' which is not a BlockDeviceMapping object.
>
> [...]
>
> > instance_disk_info
> > =
> >
> > The driver api defines a method get_instance_disk_info, which returns a
> > json blob. The compute manager calls this and passes the data over rpc
> > between calls without ever looking at it. This is driver-specific opaque
> > data. It is also only used by the libvirt driver, despite being part of
> the
> > api for all drivers. Other drivers do not return any data. The most
> > interesting aspect of instance_disk_info is that it is generated from the
> > libvirt XML, not from nova's state.
> >
> > Reader beware: instance_disk_info is often named 'disk_info' in code,
> which
> > is unfortunate as this clashes with the normal naming of the next
> > structure. Occasionally the two are used in the same block of code.
> >
> > instance_disk_info is a list of dicts for some of an instance's disks.
>
> The above sentence reads a little awkward (maybe it's just me), might
> want to rephrase it if you're submitting it as a Gerrit change.
>

Yeah. I think that's a case of re-editing followed by inadequate proof
reading.


> While reading this section, among other places, I was looking at:
> _get_instance_disk_info() ("Get the non-volume disk information from the
> domain xml") from nova/virt/libvirt/driver.py.
>

non-volume or Rbd ;) I've become a bit cautious about such docstrings: they
aren't always correct :/


>
> > Reader beware: Rbd disks (including non-volume disks) and cinder volumes
> > are not included in instance_disk_info.
> >
> > The dicts are:
> >
> >   {
> > 'type': libvirt's notion of the disk's type
> > 'path': libvirt's notion of the disk's path
> > 'virt_disk_size': The disk's virtual size in bytes (the size the
> guest
> > OS sees)
> > 'backing_file': libvirt's notion of the backing file path
> > 'disk_size': The file size of path, in bytes.
> > 'over_committed_disk_size': As-yet-unallocated disk size, in bytes.
> >   }
> >
> > disk_info
> > ===
> >
> > Reader beware: as opposed to instance_disk_info, which is frequently
> called
> > disk_info.
> >
> > This data structure is actually described pretty well in the comment
> block
> > at the top of libvirt/blockinfo.py. It is internal to the libvirt driver.
> > It contains:
> >
> >   {
> >     'disk_bus': the default bus used by disks
> > 'cdrom_bus': the default bus used by cdrom drives
> > 'mapping': defined below
> >   }
> >
> > 'mapping' is a dict which maps disk names to a dict describing how that
> > disk should be passed to libvirt. This mapping contains every disk
> > connected to the instance, both local and volumes.
>
> Worth updating exising defintion of 'mapping' in
> nova/virt/libvirt/blockinfo.py with your above clearer description
> above?
>

Indubitably.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] A primer on data structures used by Nova to represent block devices

2016-06-16 Thread Matthew Booth
 beware: instance_disk_info is often named 'disk_info' in code, which
is unfortunate as this clashes with the normal naming of the next
structure. Occasionally the two are used in the same block of code.

instance_disk_info is a list of dicts for some of an instance's disks.

Reader beware: Rbd disks (including non-volume disks) and cinder volumes
are not included in instance_disk_info.

The dicts are:

  {
'type': libvirt's notion of the disk's type
'path': libvirt's notion of the disk's path
'virt_disk_size': The disk's virtual size in bytes (the size the guest
OS sees)
'backing_file': libvirt's notion of the backing file path
'disk_size': The file size of path, in bytes.
'over_committed_disk_size': As-yet-unallocated disk size, in bytes.
  }

disk_info
===

Reader beware: as opposed to instance_disk_info, which is frequently called
disk_info.

This data structure is actually described pretty well in the comment block
at the top of libvirt/blockinfo.py. It is internal to the libvirt driver.
It contains:

  {
'disk_bus': the default bus used by disks
'cdrom_bus': the default bus used by cdrom drives
'mapping': defined below
  }

'mapping' is a dict which maps disk names to a dict describing how that
disk should be passed to libvirt. This mapping contains every disk
connected to the instance, both local and volumes.

First, a note on disk naming. Local disk names used by the libvirt driver
are well defined. They are:

'disk': the root disk
'disk.local': the flavor-defined ephemeral disk
'disk.ephX': where X is a zero-based index for bdm-defined ephemeral disks.
'disk.swap': the swap disk
'disk.config': the config disk

These names are hardcoded, reliable, and used in lots of places.

In disk_info, volumes are keyed by device name, eg 'vda', 'vdb'. Different
busses will be named differently, approximately according to legacy Linux
device naming.

Additionally, disk_info will contain a mapping for 'root', which is the
root disk. This will duplicate one of the other entries, either 'disk' or a
volume mapping.

The information for each disk is:

  {
'bus': the bus for this disk
'dev': the device name for this disk as known to libvirt
'type': A type from the BlockDeviceType enum ('disk', 'cdrom',
'floppy', 'fs', or 'lun')

=== keys below are optional, and may not be present
'format': Used to format swap/ephemeral disks before passing to
instance (e.g. 'swap', 'ext4')
'boot_index': the 1-based boot index of the disk.
  }

Reader beware: BlockDeviceMapping and DriverBlockDevice store boot index
zero-based. However, libvirt's boot index is 1-based, so the value stored
here is 1-based.

Cleanups
===

https://review.openstack.org/329366 rename libvirt has_default_ephemeral
https://review.openstack.org/329927 Remove virt.block_device._NoLegacy
exception
https://review.openstack.org/329928 Remove unused context argument to
_default_block_device_names()
https://review.openstack.org/329930 Rename compute manager _check_dev_name
to _add_missing_dev_names
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] theoretical race between live migration and resource audit?

2016-06-10 Thread Matthew Booth
Yes, this is a race.

However, it's my understanding that this is 'ok'. The resource tracker
doesn't claim to be 100% accurate at all times, right? Otherwise why would
it update itself in a period task in the first place. It's my understanding
that the resource tracker is basically a best effort cache, and that
scheduling decisions can still fail at the host. The resource tracker will
fix itself next time it runs via its periodic task.

Matt (not a scheduler person)

On Thu, Jun 9, 2016 at 10:41 PM, Chris Friesen <chris.frie...@windriver.com>
wrote:

> Hi,
>
> I'm wondering if we might have a race between live migration and the
> resource audit.  I've included a few people on the receiver list that have
> worked directly with this code in the past.
>
> In _update_available_resource() we have code that looks like this:
>
> instances = objects.InstanceList.get_by_host_and_node()
> self._update_usage_from_instances()
> migrations = objects.MigrationList.get_in_progress_by_host_and_node()
> self._update_usage_from_migrations()
>
>
> In post_live_migration_at_destination() we do this (updating the host and
> node as well as the task state):
> instance.host = self.host
> instance.task_state = None
> instance.node = node_name
> instance.save(expected_task_state=task_states.MIGRATING)
>
>
> And in _post_live_migration() we update the migration status to
> "completed":
> if migrate_data and migrate_data.get('migration'):
> migrate_data['migration'].status = 'completed'
> migrate_data['migration'].save()
>
>
> Both of the latter routines are not serialized by the
> COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in
> _update_available_resource().
>
>
> I'm wondering if we can have a situation like this:
>
> 1) migration in progress
> 2) We start running _update_available_resource() on destination, and we
> call instances = objects.InstanceList.get_by_host_and_node().  This will
> not return the migration, because it is not yet on the destination host.
> 3) The migration completes and we call
> post_live_migration_at_destination(), which sets the host/node/task_state
> on the instance.
> 4) In _update_available_resource() on destination, we call migrations =
> objects.MigrationList.get_in_progress_by_host_and_node().  This will return
> the migration for the instance in question, but when we run
> self._update_usage_from_migrations() the uuid will not be in "instances"
> and so we will use the instance from the newly-queried migration.  We will
> then ignore the instance because it is not in a "migrating" state.
>
> Am I imagining things, or is there a race here?  If so, the negative
> effects would be that the resources of the migrating instance would be
> "lost", allowing a newly-scheduled instance to claim the same resources
> (PCI devices, pinned CPUs, etc.)
>
> Chris
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] Is verification of images in the image cache necessary?

2016-05-24 Thread Matthew Booth
On Tue, May 24, 2016 at 1:15 PM, Fichter, Dane G. <dane.fich...@jhuapl.edu>
wrote:

> Hi John and Matt,
>
> I actually have a spec and patch up for review addressing some of what
> you’re referring to below.
>
> https://review.openstack.org/#/c/314222/
> https://review.openstack.org/#/c/312210/
>
> I think you’re quite right that the existing ImageCacheManager code serves
> little purpose. What I propose here is a cryptographically stronger
> verification meant to protect against both deliberate modification by an
> adversary, as well as accidental sources of disk corruption. If you like, I
> can deprecate the checksum-based verification code in the image cache as a
> part of this change. Feel free me to email me back or ping me on IRC
> (dane-fichter) in order to discuss more.
>

Thanks Dane, reviewed. I don't think the details are right yet, but I do
think this is the way to go. I also think we need to entirely divorce this
functionality from the image cache.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] Is verification of images in the image cache necessary?

2016-05-24 Thread Matthew Booth
On Tue, May 24, 2016 at 11:06 AM, John Garbutt <j...@johngarbutt.com> wrote:

> On 24 May 2016 at 10:16, Matthew Booth <mbo...@redhat.com> wrote:
> > During its periodic task, ImageCacheManager does a checksum of every
> image
> > in the cache. It verifies this checksum against a previously stored
> value,
> > or creates that value if it doesn't already exist.[1] Based on this
> > information it generates a log message if the image is corrupt, but
> > otherwise takes no action. Going by git, this has been the case since
> 2012.
> >
> > The commit which added it was associated with 'blueprint
> > nova-image-cache-management phase 1'. I can't find this blueprint, but I
> did
> > find this page:
> https://wiki.openstack.org/wiki/Nova-image-cache-management
> > . This talks about 'detecting images which are corrupt'. It doesn't
> explain
> > why we would want to do that, though. It also doesn't seem to have been
> > followed through in the last 4 years, suggesting that nobody's really
> that
> > bothered.
> >
> > I understand that corruption of bits on disks is a thing, but it's a
> thing
> > for more than just the image cache. I feel that this is a problem much
> > better solved at other layers, prime candidates being the block and
> > filesystem layers. There are existing robust solutions to bitrot at both
> of
> > these layers which would cover all aspects of data corruption, not just
> this
> > randomly selected slice.
>
> +1
>
> That might mean improved docs on the need to configure such a thing.
>
> > As it stands, I think this code is regularly running a pretty expensive
> task
> > looking for something which will very rarely happen, only to generate a
> log
> > message which nobody is looking for. And it could be solved better in
> other
> > ways. Would anybody be sad if I deleted it?
>
> For completeness, we need to deprecate it using the usual cycles:
>
> https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html


I guess I'm arguing that it isn't a feature, and never has been: it really
doesn't do anything at all except generate a log message. Are log messages
part of the deprecation contract?

If operators are genuinely finding corrupt images to be a problem and find
this log message useful that would be extremely useful to know.


> I like the idea of checking the md5 matches before each boot, as it
> mirrors the check we do after downloading from glance. Its possible
> thats very unlikely to spot anything that shouldn't already be worried
> about by something else. It may just be my love of symmetry that makes
> me like that idea?
>

It just feels arbitrary to me for a few reasons. Firstly, it's only
relevant to storage schemes which use the file in the image cache as a
backing file. In this libvirt driver, this is just the qcow2 backend. While
this is the default, most users are actually using ceph. Assuming it isn't
cloning it directly from ceph-backed glance, the Rbd backend imports from
the image cache during spawn, and has nothing to do with it thereafter. So
for Rbd we'd want to check during spawn. Same for the Flat, Lvm and Ploop
backends.

Except that it's still arbitrary because we're not checking the Qcow
overlay on each boot. Or ephemeral or swap disks. Or Lvm, Flat or Rbd disks
at all. Or the operating system. And it's still expensive, and better done
by the block or filesystem layer.

I'm not personally convinced there's all that much point checking during
download either, but given that we're loading all the bits anyway that
check is essentially free. However, even if we decided we needed to defend
the system against bitrot above the block/filesystem layer (and I'm not at
all convinced of that) we'd want a coordinated design for it. Without one,
we risk implementing a bunch of disconnected/incomplete stuff that doesn't
meet anybody's needs, but burns resources anyway.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Is verification of images in the image cache necessary?

2016-05-24 Thread Matthew Booth
During its periodic task, ImageCacheManager does a checksum of every image
in the cache. It verifies this checksum against a previously stored value,
or creates that value if it doesn't already exist.[1] Based on this
information it generates a log message if the image is corrupt, but
otherwise takes no action. Going by git, this has been the case since 2012.

The commit which added it was associated with 'blueprint
nova-image-cache-management phase 1'. I can't find this blueprint, but I
did find this page:
https://wiki.openstack.org/wiki/Nova-image-cache-management . This talks
about 'detecting images which are corrupt'. It doesn't explain why we would
want to do that, though. It also doesn't seem to have been followed through
in the last 4 years, suggesting that nobody's really that bothered.

I understand that corruption of bits on disks is a thing, but it's a thing
for more than just the image cache. I feel that this is a problem much
better solved at other layers, prime candidates being the block and
filesystem layers. There are existing robust solutions to bitrot at both of
these layers which would cover all aspects of data corruption, not just
this randomly selected slice.

As it stands, I think this code is regularly running a pretty expensive
task looking for something which will very rarely happen, only to generate
a log message which nobody is looking for. And it could be solved better in
other ways. Would anybody be sad if I deleted it?

Matt

[1] Incidentally, there also seems to be a bug in this implementation, in
that it doesn't hold the lock on the image itself at any point during the
hashing process, meaning that it cannot guarantee that the image has
finished downloading yet.
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][live migration] Request to merge initial storage pool patches

2016-05-05 Thread Matthew Booth
I mentioned in the meeting last Tuesday that there are now 2 of us working
on the persistent storage metadata patches: myself and Diana Clarke. I've
also been talking to Paul Carlton today trying to work out how he can get
moving with the subsequent libvirt storage pools work without a huge amount
of conflict. We're going to work that out, but something which would help
us enormously while we work together is to knock off some semi-dependent
patches at the beginning of the series.

Specifically there are 5 patches which precede the series. None of these 5
patches are really part of the series, but the series depends on all 5 to a
greater or lesser degree. In order, they are:

Only attempt to inject files if the injection disk exists
https://review.openstack.org/#/c/250872/

Remove fake_imagebackend.Raw and cleanup dependent tests
https://review.openstack.org/#/c/267661/

Rename Image.check_image_exists to Image.exists()
https://review.openstack.org/#/c/270998/

Rename Raw backend to NoBacking
https://review.openstack.org/#/c/279626/

Remove deprecated option libvirt.remove_unused_kernels
https://review.openstack.org/#/c/265886/

These have all been reviewed previously and attracted +1s. I just rebased
them, so they may have been lost, but I'm pretty sure they're ready for
prime time. These patches aren't really what the series is about, but they
are currently the principal source of merge conflicts. If we could get
these out of the way it would make it much easier for the 3 of us working
on the substantive series. Any chance of some core reviewer attention to
get these moving?

As for the status of the rest of the series, there are 2 more which I don't
expect to change:

Add a lock() context manager to image backend
https://review.openstack.org/#/c/279625/

Introduce ImageCacheLocalPool
https://review.openstack.org/#/c/279669/

Please review these too. However, these patches aren't ready to be merged
because they don't add users of the interfaces they introduce.

Everything after that is definitely changing. Diana and I are currently
working on cranking through these backend by backend. I'll provide a weekly
progress update in the live migration meeting.

TL;DR Core reviewers: please review the first 5 patches listed above. There
will be cake.

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Should be instance_dir in all nova compute node same ?

2016-05-03 Thread Matthew Booth
On Fri, Apr 29, 2016 at 2:47 AM, Eli Qiao <liyong.q...@intel.com> wrote:

> hi team,
>
> Is there any require that all compute node's instance_dir should be same?
>

Yes. This is assumed in many places, certainly in cold migration/resize.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][live-migration] Libvirt storage pools and persistent disk metadata specs

2016-04-06 Thread Matthew Booth
I've just submitted a new spec for the imagebackend work was doing in
Mitaka, which is a prerequisite for the libvirt storage pools work:

https://review.openstack.org/#/c/302117/

I was just about to resubmit libvirt storage pools with my spec as a
dependency, but re-reading it perhaps we should discuss it briefly first.
I'm aware that the driving goal for this move is to enable data transfer
over the libvirt pipe. However, while we're messing with the on-disk layout
it might be worth thinking about the image cache.

The current implementation of the image cache is certainly problematic. It
has expanded far beyond its original brief, and it shows. The code has no
coherent structure, which makes following it very difficult even to those
familiar with it. Worst of all, though, the behaviour of the image cache is
distributed across several different modules, with no obvious links between
those modules for the unwary (aka tight coupling). The image cache relies
on locking for correct behaviour, but this locking is also distributed
across many modules. Verifying the correct behaviour of the image cache's
locking is hard enough to be impractical, which shows in the persistent
stream of bugs we see relating to backing files going missing. In short,
the image cache implementation needs an extreme overhaul anyway in the
light of its current usage.

We also need to address the problem that the image cache doesn't store any
metadata about its images. We currently determine the file format of an
image cache entry by inspection. While we sanity check images when writing
them to the image cache, this is not a robust defence against format bugs
and vulnerabilities.

More that this, though, the design of the image cache no longer makes
sense. When the image cache was originally implemented, there was only
local file storage, but the libvirt driver also now supports LVM and ceph.
Over 60% of our users use ceph, so ceph is really important. The ceph
backend is often able to use images directly from glance if they're also in
ceph, but when they're not it continues to use this local file store, which
makes no sense.

When we move to libvirt storage pools (with my dependent change), we open
the possibility for users to have multiple local storage pools, and have
instance storage allocated between them according to policy defined in the
flavor. If we're using a common image cache for all storage pools, this
limits the differentiation between those storage pools. So if a user's
paying for instance storage on SSD, but the backing file is on spinning
rust, that's not great.

Logically, the image cache should be a property of the storage backend. So,
local file stores should cache images in the local file store. LVM should
cache images in LVM, which would also allow it to use writeable snapshots.
Ceph should cache images in the same pool its instance disks are in. This
means that the image cache is always directly usable by its storage pool,
which is its purpose.

The image cache also needs a robust design for operating on shared storage.
It currently doesn't have one, although the locking means that it's
hopefully probably only maybe slightly a little racy, with any luck,
perhaps.

This change may be too large to do right now, but we should understand that
changing it later will require a further migration of some description, and
bear that in mind. A local file/lvm/ceph storage pool with an external
image cache has a different implementation and layout to the same storage
pool with an integrated image cache. Data is stored in different places,
and is managed differently. If we don't do it now, it will be slightly
harder later.

Reading through the existing spec, I also notice that it mentions use of
the 'backingStore' element. This needs to come out of the spec, as we MUST
NOT use this. The problem is that it's introspected. I don't know if
libvirt stores this persistently while it's running, but it most certainly
recreates it after a restart or pool refresh. Introspecting file formats
and backing files when the user can influence it is a severe security hole,
so we can't do it. Instead, we need to use the metadata created and defined
by the spec I linked at the top.

There will also need to be a lot more subclassing than this spec
anticipates. Specifically, I expect most backends to require a subclass. In
particular, the libvirt api doesn't provide storage locking, so we will
have to implement that for each backend.

I don't want to spend too long on the spec. The only thing worth of
discussion is the image cache, I guess.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Migration shared storage negotiation

2016-03-07 Thread Matthew Booth
Timofey has posted a patch which aims to improve shared storage negotiation
a little:

  https://review.openstack.org/#/c/286745/

I've been thinking for some time about going bigger. It occurs to me that
the destination node has enough context to know what should be present. I
think it would be cleaner for the destination to check what it has, and
send the delta back to the source in migration_data. I think that would
remove the need for shared storage negotiation entirely. Of course, we'd
need to be careful about cleaning up after ourselves, but that's already
true.

Thoughts?

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] A proposal to separate the design summit

2016-02-22 Thread Matthew Booth
+1

Matt

On Mon, Feb 22, 2016 at 3:31 PM, Monty Taylor <mord...@inaugust.com> wrote:

> On 02/22/2016 07:24 AM, Russell Bryant wrote:
>
>>
>>
>> On Mon, Feb 22, 2016 at 10:14 AM, Thierry Carrez <thie...@openstack.org
>> <mailto:thie...@openstack.org>> wrote:
>>
>> Hi everyone,
>>
>> TL;DR: Let's split the events, starting after Barcelona.
>>
>>
>> This proposal sounds fantastic.  Thank you very much to those that help
>> put it together.
>>
>
> Totally agree. I think it's an excellent way to address the concerns and
> balance all of the diverse needs we have.
>
> Thank you very much!
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-22 Thread Matthew Booth
On Tue, Jan 19, 2016 at 8:47 PM, Fox, Kevin M  wrote:

> One feature I think we would like to see that could benefit from LVM is
> some kind of multidisk support with better fault tolerance
>
> For example:
> Say you have a node, and there are 20 vm's on it, and thats all the disk
> io it could take. But say you have spare cpu/ram capacity other then the
> diskio being used up. It would be nice to be able to add a second disk, and
> be able to launch 20 more vm's, located on the other disk.
>
> If you combined them together into one file system (linear append or
> raid0), you could loose all 40 vm's if something went wrong. That may be
> more then you want to risk. If you could keep them as separate file systems
> or logical volumes (maybe with contigous lv's?) Each vm could only top out
> a spindle, but it would be much more fault tolerant to failures on the
> machine. I can see some cases where that tradeoff between individual vm
> performance and number of vm's affected by a device failure can lean in
> that direction.
>
> Thoughts?


This is simple enough for a compute host. However, the real problem is in
the scheduler. The scheduler needs to be aware that there are multiple
distinct resource pools on the host. For e.g., if your 2 pools have 20G of
disk each, that doesn't mean you can spin up an instance with 30G. This is
the same problem the VMware driver has, and to the best of my knowledge
there's still no solution to that.

 Matt
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-19 Thread Matthew Booth
Hello, Premysl,

I'm not working on these features, however I am working in this area of
code implementing the libvirt storage pools spec. If anybody does start
working on this, please reach out to coordinate as I have a bunch of
related patches. My work should also make your features significantly
easier to implement.

Out of curiosity, can you explain why you want to use LVM specifically over
the file-based backends?

Matt

On Mon, Jan 18, 2016 at 7:49 PM, Premysl Kouril <premysl.kou...@gmail.com>
wrote:

> Hello everybody,
>
> we are a Europe based operator and we have a case for LVM based nova
> instances in our new cloud infrastructure. We are currently
> considering to contribute to OpenStack Nova to implement some features
> which are currently not supported for LVM based instances (they are
> only supported for raw/qcow2 file based instances). As an example of
> such features - nova block live migration or thin provisioning - these
> nowadays don't work with LVM based instances (they do work for file
> based).
>
> Before actually diving into development here internally - we wanted to
> check on possibility to actually sponsor this development within
> existing community. So if there is someone who would be interested in
> this work please drop me an email.
>
> Regards,
> Prema
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][stable] Freeze exception for kilo CVE-2015-7548 backports

2016-01-15 Thread Matthew Booth
The following 3 patches fix CVE-2015-7548 Unprivileged api user can access
host data using instance snapshot:

https://review.openstack.org/#/c/264819/
https://review.openstack.org/#/c/264820/
https://review.openstack.org/#/c/264821/

The OSSA is rated critical. The patches have now landed on master and
liberty after some delays in the gate. Given the importance of the fix I
suspect that most/all downstream distributions will have already patched
(certainly Red Hat has), but it would be good to have them in upstream
stable.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] (Live migration track) Libvirt storage pools initial step

2015-12-17 Thread Matthew Booth
I'm off now for the holiday period, so I'm sharing what I'm sitting on.
This represents a bit over a week of staring at code, and a couple of days
of actual keyboard banging. It doesn't run. Details are in the commit
message, including where I think I'm going with this. Read the commit
message like a slightly more detailed spec.

  https://review.openstack.org/259148

I have a reasonable idea of where I want to go with this, but finding a
consumable set of steps to get there is challenging. The current code makes
lots of assumptions about storage in lots of places. Consequently, adding
to it without regressing is becoming increasingly difficult. Making the
wholesale change to libvirt storage pools, supporting multiple layouts in
the transition period, would require updating these assumptions everywhere
they exist in the code. My initial goal is to centralise these assumptions
whilst making minimal/no functional changes. Once all the code which
assumes on-disk layout is in one place, we can be more confident in making
changes to it, and more easily validate it. We can also more easily see
what the current assumptions are. This will also make it easier for people
to continue adding storage-related features to the libvirt driver.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] What things do we want to get into a python-novaclient 3.0 release?

2015-11-20 Thread Matthew Booth
I wrote this a while back, which implements 'migrate everything off this
compute host' in the most robust manner I could come up with using only the
external api:

https://gist.github.com/mdbooth/163f5fdf47ab45d7addd

It obviously overlaps considerably with host-servers-migrate, which is
supposed to do the same thing. Users seem to have been appreciative, so I'd
be interested to see it merged in some form.

Matt

On Thu, Nov 19, 2015 at 6:18 PM, Matt Riedemann 
wrote:

> We've been talking about doing a 3.0 release for novaclient for awhile so
> we can make some backward incompatible changes, like:
>
> 1. Removing the novaclient.v1_1 module
> 2. Dropping py26 support (if there is any explicit py26 support in there)
>
> What else are people aware of?
>
> Monty was talking about doing a thing with auth:
>
> https://review.openstack.org/#/c/245200/
>
> But it sounds like that is not really needed now?
>
> I'd say let's target mitaka-2 for a 3.0 release and get these flushed out.
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.messaging] State wrapping in the MessageHandlingServer

2015-11-11 Thread Matthew Booth
On Tue, Nov 10, 2015 at 6:46 PM, Joshua Harlow <harlo...@fastmail.com>
wrote:

> Matthew Booth wrote:
>
>> My patch to MessageHandlingServer is currently being reverted because it
>> broke Nova tests:
>>
>> https://review.openstack.org/#/c/235347/
>>
>> Specifically it causes a number of tests to take a very long time to
>> execute, which ultimately results in the total build time limit being
>> exceeded. This is very easy to replicate. The
>> test
>> nova.tests.functional.test_server_group.ServerGroupTest.test_boot_servers_with_affinity
>> is an example test which will always hit this issue. The problem is
>> that ServerGroupTest.setUp() does:
>>
>>  self.compute2 = self.start_service('compute', host='host2')
>>  self.addCleanup(self.compute2.kill)
>>
>> The problem with this is that start_service() adds a fixture which also
>> adds kill as a cleanup method. kill does stop(), wait(). This means that
>> the resulting call order is: start, stop, wait, stop, wait. The
>> redundant call to kill is obviously a wart, but I feel we should have
>> handled it anyway.
>>
>> The problem is that we decided it should be possible to restart a
>> server. There are some unit tests in oslo.messaging that do this. It's
>> not clear to me that there are any projects which do this, but after
>> this experience I feel like it would be good to check before changing it
>> :)
>>
>> The implication of that is that after wait() the state wraps, and we're
>> now waiting on start() again. Consequently, when the second cleanup call
>> hangs.
>>
>> We could fix Nova (at least the usage we have seen) by removing the
>> wrapping. After wait() if you want to start a server again you need to
>> create a new one.
>>
>> So, to be specific, lets consider the following 2 call sequences:
>>
>> 1. start stop wait stop wait
>> 2. start stop wait start stop wait
>>
>> What should they do? The behaviours with and without wrapping are:
>>
>> 1. start stop wait stop wait
>> WRAP: start stop wait HANG HANG
>> NO WRAP: start stop wait NO-OP NO-OP
>>
>> 2. start stop wait start stop wait
>> WRAP: start stop wait start stop wait
>> NO WRAP: start stop wait NO-OP NO-OP NO-OP
>>
>> I'll refresh my memory on what they did before my change in the morning.
>> Perhaps it might be simpler to codify the current behaviour, but iirc I
>> proposed this because it was previously undefined due to races.
>>
>
> I personally prefer not allowing restarting, its needless code complexity
> imho and a feature that people imho probably aren't using anyway (just
> create a new server object if u are doing this), so I'd be fine with doing
> the above NO WRAP and turning those into NO-OPs (and for example raising a
> runtime error in the case of start stop wait start ... to denote that
> restarting isn't recommended/possible). If we have a strong enough reason
> to really to start stop wait start ...
>
> I might be convinced the code complexity is worth it but for now I'm not
> convinced...
>

I agree, and in the hopefully unlikely event that we did break anybody, at
least they would get an obvious exception rather than a hang. A lesson from
breaking nova was that the log messages were generated and were available
in the failed test runs, but nobody noticed them.

Incidentally, I think I'd also merge my second patch into the first before
resubmitting, which adds timeouts and the option not to log.

Matt
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo.messaging] State wrapping in the MessageHandlingServer

2015-11-10 Thread Matthew Booth
My patch to MessageHandlingServer is currently being reverted because it
broke Nova tests:

https://review.openstack.org/#/c/235347/

Specifically it causes a number of tests to take a very long time to
execute, which ultimately results in the total build time limit being
exceeded. This is very easy to replicate. The
test 
nova.tests.functional.test_server_group.ServerGroupTest.test_boot_servers_with_affinity
is an example test which will always hit this issue. The problem is
that ServerGroupTest.setUp() does:

self.compute2 = self.start_service('compute', host='host2')
self.addCleanup(self.compute2.kill)

The problem with this is that start_service() adds a fixture which also
adds kill as a cleanup method. kill does stop(), wait(). This means that
the resulting call order is: start, stop, wait, stop, wait. The redundant
call to kill is obviously a wart, but I feel we should have handled it
anyway.

The problem is that we decided it should be possible to restart a server.
There are some unit tests in oslo.messaging that do this. It's not clear to
me that there are any projects which do this, but after this experience I
feel like it would be good to check before changing it :)

The implication of that is that after wait() the state wraps, and we're now
waiting on start() again. Consequently, when the second cleanup call hangs.

We could fix Nova (at least the usage we have seen) by removing the
wrapping. After wait() if you want to start a server again you need to
create a new one.

So, to be specific, lets consider the following 2 call sequences:

1. start stop wait stop wait
2. start stop wait start stop wait

What should they do? The behaviours with and without wrapping are:

1. start stop wait stop wait
WRAP: start stop wait HANG HANG
NO WRAP: start stop wait NO-OP NO-OP

2. start stop wait start stop wait
WRAP: start stop wait start stop wait
NO WRAP: start stop wait NO-OP NO-OP NO-OP

I'll refresh my memory on what they did before my change in the morning.
Perhaps it might be simpler to codify the current behaviour, but iirc I
proposed this because it was previously undefined due to races.

Matt
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Fwd: [nova][mistral] Automatic evacuation as a long running task

2015-10-19 Thread Matthew Booth
Accidentally sent this privately.

-- Forwarded message --
From: Matthew Booth <mbo...@redhat.com>
Date: Fri, Oct 9, 2015 at 6:14 PM
Subject: Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long
running task
To: "Deja, Dawid" <dawid.d...@intel.com>


On Thu, Oct 8, 2015 at 12:51 PM, Deja, Dawid <dawid.d...@intel.com> wrote:

> Hi Matthew,
>
> Thanks for bringing some light on what problems has nova with evacuation
> of an instance. It is very important to have those limitations in mind when
> preparing final solution. Or to fix them, as you proposed.
>
> Nevertheless, I would say that evacuationD does more than what calling
> 'nova host-evacuate' do. Let's consider such scenario:
>
> 1. Call 'nova host evacuate HostX'
> 2. Caller dies during call - information that some VMs are still to be
> evacuated is lost.
>

No, it's not lost because the instances still have instance.host == source.
This means that you can (and must, in fact) simply run 'nova host-evacuate'
again if it didn't complete successfully the first time.

Note that an external agent (lets call it pacemaker) must solve exactly the
same problem in order to send a message to evacuated. It must assure itself
that it successfully 'sent a message' at least once, somewhere. Now replace
'sent a message' with 'ran nova host-evacuate'.


> Such thing would not happen with evacuationD, because it prepares one
> rabbitMQ message for each VM that needs to be evacuated. Moreover, it deals
> with situation, when process that lists VMs crashes. In such case, whole
> operation would be continued by another daemon.
>
> EvacD may also handle another problem that you mentioned: failure of
> target host of evacuation. In such scenario, 'evacuate host' message will
> be send for a new host and EvacD will try to evacuate all of it's vms -
> even those in rebuild state. Of course, evacuation of such instances fails,
> but they would eventually enter error state and evacuationD would start
> resurrection process. This can be speed up by setting instances state to
> 'error' (despite these which are in 'active' state) on the beginning of
> whole 'evacuate host' process.
>

Again, this situation is identical to simply running nova host-evacuate.
EvacD doesn't do any monitoring, and requires an external agent (we called
it pacemaker) to invoke it for the newly failed host. In this scenario,
whatever sends 'evacuate host' can instead run 'nova host-evacuate', and
the behaviour is identical.


>
> Finally, another action - called 'Look for VM' - could be added. It would
> check if given VM ended up in active state on new hosts; if no, VM could be
> rebuild. I hope this would give us as much certainty that VM is alive as
> possible.
>

This would add behaviour over nova host-evacuate. However, it would also be
considerably more complex to implement and what's there currently doesn't
add any infrastructure which enables it.

Remember that nova evacuate is not a heavy operation for the caller. It is
literally just a nova api call which returns after kicking off a task in
conductor. Running nova host-evacuate does:

1. List all instances
2. For instance 0, tell nova to initiate evac
3 ...

Running evacD does:
1. List all instances
2. For instance 0, send ourselves a message to initiate evac
... rabbit ...
3. For instance 0, tell nova to initiate evac

In other words, evacD just makes the call chain longer. It adds overhead
and additional potential points of failure. Ironically, this means the
resulting solution will be less robust.

Matt


> On Tue, 2015-10-06 at 16:34 +0100, Matthew Booth wrote:
> Hi, Roman,
>
> Evacuated has been on my radar for a while and this post has prodded me to
> take a look at the code. I think it's worth starting by explaining the
> problems in the current solution. Nova client is currently responsible for
> doing this evacuate. It does:
>
> 1. List all instances on the source host
> 2. Initiate evacuate for each instance
>
> Evacuating a single instance does:
>
> API:
> 1. Set instance task state to rebuilding
> 2. Create a migration record with source and dest if specified
>
> Conductor:
> 3. Call the scheduler to get a destination host if not specified
> 4. Get the migration object from the db
>
> Compute:
> 5. Rebuild the instance on dest
> 6. Update instance.host to dest
>
> Examining single instance evacuation, the first obvious thing to look at
> is what if 2 happen simultaneously. Because step 1 is atomic, it should not
> be possible to initiate 2 evacuations simultaneously of a single instance.
> However, note that this atomic action hasn't updated the instance host,
> meaning the source host remains the owner of this instance. If the
> evacuation process fails to complete, the source host wil

Re: [openstack-dev] [all] -1 due to line length violation in commit messages

2015-10-07 Thread Matthew Booth
On Fri, Sep 25, 2015 at 3:44 PM, Ihar Hrachyshka 
wrote:

> Hi all,
>
> releases are approaching, so it’s the right time to start some bike
> shedding on the mailing list.
>
> Recently I got pointed out several times [1][2] that I violate our commit
> message requirement [3] for the message lines that says: "Subsequent lines
> should be wrapped at 72 characters.”
>
> I agree that very long commit message lines can be bad, f.e. if they are
> 200+ chars. But <= 79 chars?.. Don’t think so. Especially since we have 79
> chars limit for the code.
>
> We had a check for the line lengths in openstack-dev/hacking before but it
> was killed [4] as per openstack-dev@ discussion [5].
>
> I believe commit message lines of <=80 chars are absolutely fine and
> should not get -1 treatment. I propose to raise the limit for the guideline
> on wiki accordingly.
>

IIUC, the lower limit for commit messages is because git displays them
indented by default, which means that lines which are 80 chars long will
wrap on a display which is 80 chars wide. I personally use terminal windows
which are 80 chars wide, and I do find long lines in commit messages
annoying, so I'm personally in favour of retaining the lower limit. Can't
say I'd storm any castles if it was changed, but if most people use git the
way I do[1] I guess it should stay.

Matt

[1] I have no idea if this is the case.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long running task

2015-10-06 Thread Matthew Booth
Hi, Roman,

Evacuated has been on my radar for a while and this post has prodded me to
take a look at the code. I think it's worth starting by explaining the
problems in the current solution. Nova client is currently responsible for
doing this evacuate. It does:

1. List all instances on the source host
2. Initiate evacuate for each instance

Evacuating a single instance does:

API:
1. Set instance task state to rebuilding
2. Create a migration record with source and dest if specified

Conductor:
3. Call the scheduler to get a destination host if not specified
4. Get the migration object from the db

Compute:
5. Rebuild the instance on dest
6. Update instance.host to dest

Examining single instance evacuation, the first obvious thing to look at is
what if 2 happen simultaneously. Because step 1 is atomic, it should not be
possible to initiate 2 evacuations simultaneously of a single instance.
However, note that this atomic action hasn't updated the instance host,
meaning the source host remains the owner of this instance. If the
evacuation process fails to complete, the source host will automatically
delete it if it comes back up because it will find a migration record, but
it will not be rebuilt anywhere else. Evacuating it again will fail,
because its task state is already rebuilding.

Also, let's imagine that the conductor crashes. There is not enough state
for any tool, whether internal or external, to be able to know if the
rebuild is ongoing somewhere or not, and therefore whether it is safe to
retry even if that retry would succeed, which it wouldn't.

Which is to say that we can't currently robustly evacuate one instance!

Looking at the nova client side, there is an obvious race there: there is
no guarantee in step 2 that instances returned in step one have not already
been evacuated by another process. We're protected here, though because
evacuating a single instance twice will fail the second time. Note that the
process isn't idempotent, though, because an evacuation which falls into a
hole will never be retried.

Moving on to what evacuated does. Evacuated uses rabbit to distribute jobs
reliably. There are 2 jobs in evacuated:

1. Evacuate host:
  1.1 Get list of all instances on the source host from Nova
  1.2 Send an evacuate vm job for each instance
2. Evacuate vm:
  2.1 Tell Nova to start evacuating an instance

Because we're using rabbit as a reliable message bus, the initiator of one
of the tasks knows that it will eventually run to completion at least once.
Note that there's nothing to prevent the task being executed more than once
per call, though. A task may crash before sending an ack, or may just be
really slow. However, in both cases, for exactly the same reasons as for
the implementation in nova client, running more than once should not race.
It is still not idempotent, though, again for exactly the same reasons as
nova client.

Also notice that, exactly as in the nova client implementation, we are not
asserting that an instance has been evacuated. We are only asserting that
we called nova.evacuate, which is to say that we got as far as step 2 in
the evacuation sequence above.

In other words, in terms of robustness, calling evacuated's evacuate host
is identical to asserting that nova client's evacuate host ran to
completion at least once, which is quite a lot simpler to do. That's still
not very robust, though: we don't recover from failures, and we don't
ensure that an instance is evacuated, only that we started an attempt to
evacuate at least once. I'm obviously not satisfied with nova client,
however as the implementation is simpler I would favour it over evacuated.

I believe we can solve this problem, but I think that without fixing
single-instance evacuate we're just pushing the problem around (or creating
new places for it to live). I would base the robustness of my
implementation on a single principal:

  An instance has a single owner, which is exclusively responsible for
rebuilding it.

In outline, I would redefine the evacuate process to do:

API:
1. Call the scheduler to get a destination for the evacuate if none was
given.
2. Atomically update instance.host to this destination, and task state to
rebuilding.

Compute:
3. Rebuild the instance.

This would be supported by a periodic task on the compute host which looks
for rebuilding instances assigned to this host which aren't currently
rebuilding, and kicks off a rebuild for them. This would cover the compute
going down during a rebuild, or the api going down before messaging the
compute.

Implementing this gives us several things:

1. The list instances, evacuate all instances process becomes idempotent,
because as soon as the evacuate is initiated, the instance is removed from
the source host.
2. We get automatic recovery of failure of the target compute. Because we
atomically moved the instance to the target compute immediately, if the
target compute also has to be evacuated, our instance won't fall through
the 

[openstack-dev] [nova] Fine-grained error reporting via the external API

2015-09-11 Thread Matthew Booth
I've recently been writing a tool which uses Nova's external API. This
is my first time consuming this API, so it has involved a certain amount
of discovery. The tool is here for the curious:

  https://gist.github.com/mdbooth/163f5fdf47ab45d7addd

I have felt hamstrung by the general inability to distinguish between
different types of error. For example, if a live migration failed is it
because:

1. The compute driver doesn't support support it.

2. This instance requires block storage migration.

3. Something ephemeral.

These 3 errors all require different responses:

1. Quit and don't try again.

2. Try again immediately with the block migration argument.[1]

3. Try again in a bit.

However, all I have is that I made a BadRequest. I could potentially
grep the human readable error message, but the text of that message
doesn't form part of the API, and it may be translated in any case. As
an API consumer, it seems I can't really tell anything other than 'it
didn't work'. More than that requires guesswork, heuristics and inference.

I don't think I've missed some source of additional wisdom, but it would
obviously be great if I have. Has there ever been any effort to define
some contract around more fine-grained error reporting?

Thanks,

Matt

[1] Incidentally, this suggests to me that live migrate should just do
this anyway.
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Fine-grained error reporting via the external API

2015-09-11 Thread Matthew Booth
On 11/09/15 12:19, Sean Dague wrote:
> On 09/11/2015 05:41 AM, Matthew Booth wrote:
>> I've recently been writing a tool which uses Nova's external API. This
>> is my first time consuming this API, so it has involved a certain amount
>> of discovery. The tool is here for the curious:
>>
>>   https://gist.github.com/mdbooth/163f5fdf47ab45d7addd
>>
>> I have felt hamstrung by the general inability to distinguish between
>> different types of error. For example, if a live migration failed is it
>> because:
>>
>> 1. The compute driver doesn't support support it.
>>
>> 2. This instance requires block storage migration.
>>
>> 3. Something ephemeral.
>>
>> These 3 errors all require different responses:
>>
>> 1. Quit and don't try again.
>>
>> 2. Try again immediately with the block migration argument.[1]
>>
>> 3. Try again in a bit.
>>
>> However, all I have is that I made a BadRequest. I could potentially
>> grep the human readable error message, but the text of that message
>> doesn't form part of the API, and it may be translated in any case. As
>> an API consumer, it seems I can't really tell anything other than 'it
>> didn't work'. More than that requires guesswork, heuristics and inference.
>>
>> I don't think I've missed some source of additional wisdom, but it would
>> obviously be great if I have. Has there ever been any effort to define
>> some contract around more fine-grained error reporting?
>>
>> Thanks,
>>
>> Matt
>>
>> [1] Incidentally, this suggests to me that live migrate should just do
>> this anyway.
> 
> This is an API working group recommendation evolving here. The crux of
> which is going to be a structured json error return document that will
> contain more info. https://review.openstack.org/#/c/167793/

Thanks, Sean, that's exactly what I was looking for. I'll continue this
discussion in that review.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [all] Script to merge sort log files

2015-07-15 Thread Matthew Booth
I expect there are several existing solutions to this problem, but
here's mine (attached).

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490


log_merge.sh
Description: application/shellscript
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [all] Any other downstream developers having problems with pbr?

2015-06-26 Thread Matthew Booth
I wrote this:

https://review.openstack.org/#/c/195983/1/tools/de-pbr.py,cm

Ideally we'd fix PBR, but this seems to be expected behaviour. Thoughts?

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Unvalidated user input passed to functions

2015-05-15 Thread Matthew Booth
I was looking at the migrations api, and I noticed that the api passes
the request query unchecked to get_migrations, where it ultimately ends
up in a db query. I was curious and spent a couple of hours checking
this morning. There are a few instances of this.

I didn't find any security bugs, however I feel that this extremely bad
practise, and is likely to result in a security bug eventually. For
example, note that os-assisted-volume-snapshots:delete does not validate
delete_info before passing it to volume_snapshot_delete. I looked at
this quite carefully, and I think we are only protected from a host
compromise because:

1. The api requires admin context
2. libvirt's security policy

I could be wrong on that, though, so perhaps somebody else could check?

Passing unvalidated input to a function isn't necessarily bad, for
example if it is only used for filtering, but it should be clearly
marked as such so it isn't used in an unsafe manner. This marking should
follow the data as far as it goes through any number of function calls.
libvirt's _volume_snapshot_delete function is a long way from the
originating api call, and it is not at all obvious that the commit_base
and commit_top arguments to virt_dom.blockCommit() are unvalidated.

Does python have anything like perl's taint mode? If so, it might be
worth investigating its use.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] PTL Candidacy

2015-04-02 Thread Matthew Booth
On 02/04/15 07:20, Michael Still wrote:
 I'd like another term as Nova PTL, if you'll have me.
 

...

 
 I think its a good idea also to examine briefly some statistics about specs:
 
 Juno:
approved but not implemented: 40
implemented: 49
 
 Kilo:
approved but not implemented: 30
implemented: 32

Hi, Michael,

It has been my impression over at least the last 2 releases that the
most significant barrier to progress in Nova is the lack of core
reviewer bandwidth. This affects not only feature development, but also
the much less sexy paying down of technical debt. There have been
various attempts to redefine roles and create new processes, but no
attempt that I have seen to address the underlying fundamental issue:
the lack of people who can +2 patches.

There is a discussion currently ongoing on this list, The Evolution of
core developer to maintainer, which contains a variety of proposals.
However, none of these will gain anything close to a consensus. The
result of this will be that none of them will be implemented. We will be
left by default with the status quo, and the situation will continue not
to improve despite the new processes we will invent instead.

The only way we are going to change the status quo is by fiat. We should
of course make every effort to get as many people on board as possible.
However, when change is required, but nobody can agree precisely which
change, we need positive leadership from the PTL.

Would you like to take a position on how to improve core reviewer
throughput in the next cycle?

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Nova compute will delete all your instances if you change its hostname

2015-02-27 Thread Matthew Booth
Gary Kotton originally posted this bug against the VMware driver:

https://bugs.launchpad.net/nova/+bug/1419785

I posted a proposed patch to fix this here:

https://review.openstack.org/#/c/158269/1

However, Dan Smith pointed out that the bug can actually be triggered
against any driver in a manner not addressed by the above patch alone. I
have confirmed this against a libvirt setup as follows:

1. Create some instances
2. Shutdown n-cpu
3. Change hostname
4. Restart n-cpu

Nova compute will delete all instances in libvirt, but continue to
report them as ACTIVE and Running.

There are 2 parts to this issue:

1. _destroy_evacuated_instances() should do a better job of sanity
checking before performing such a drastic action.

2. The underlying issue is the definition and use of instance.host,
instance.node, compute_node.host and compute_node.hypervisor_hostname.

(1) is belt and braces. It's very important, but I want to focus on (2)
here. Instantly you'll notice some inconsistent naming here, so to clarify:

* instance.host == compute_node.host == Nova compute's 'host' value.
* instance.node == compute_node.hypervisor_hostname == an identifier
which represents a hypervisor.

Architecturally, I'd argue that these mean:

* Host: A Nova communication endpoint for a hypervisor.
* Hypervisor: The physical location of a VM.

Note that in the above case the libvirt driver changed the hypervisor
identifier despite the fact that the hypervisor had not changed, only
its communication endpoint. I propose the following:

* ComputeNode describes 1 hypervisor.
* ComputeNode maps 1 hypervisor to 1 compute host.
* A ComputeNode is identified by a hypervisor_id.
* hypervisor_id represents the physical location of running VMs,
independent of a compute host.

We've renamed compute_node.hypervisor_hostname to
compute_node.hypervisor_id. This resolves some confusion, because it
asserts that the identity of the hypervisor is tied to the data
describing VMs, not the host which is running it. In fact, for the
VMware and Ironic drivers it has never been a hostname.

VMware[1] and Ironic don't require any changes here. Other drivers will
need to be modified so that get_available_nodes() returns a persistent
value rather than just the hostname. A reasonable default implementation
of this would be to write a uuid to a file which lives with VM data and
return its contents. If the hypervisor has a native concept of a
globally unique identifier, that should be used instead.

ComputeNode.hypervisor_id is unique. The hypervisor is unique (there is
physically only 1 of it) so it does not make sense to have multiple
representations of it and its associated resources.

An Instance's location is its hypervisor, whereever that may be, so
Instance.host could be removed. This isn't strictly necessary, but it is
redundant as the communication endpoint is available via ComputeNode. If
we wanted to support the possibility of changing a communication
endpoint at some point, it would also make that operation trivial.
Thinking blue sky, it would also open the future possibility for
multiple communication endpoints for a single hypervisor.

There is a data migration issue associated with changing a driver's
reported hypervisor id. The bug linked below fudges it, but if we were
doing it for all drivers I believe it could be handled efficiently by
passing the instance list already collected by ComputeManager.init_host
to the driver at startup.

My proposed patch above fixes a potentially severe issue for users of
the VMware and Ironic drivers. In conjunction with a move to a
persistent hypervisor id for other drivers, it also fixes the related
issue described above across the board. I would like to go forward with
my proposed fix as it has an immediate benefit, and I'm happy to work on
the persistent hypervisor id for other drivers.

Matt

[1] Modulo bugs: https://review.openstack.org/#/c/159481/
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute

2015-02-26 Thread Matthew Booth
On 25/02/15 20:18, Joe Gordon wrote:
 
 
 On Fri, Feb 20, 2015 at 3:48 AM, Matthew Booth mbo...@redhat.com
 mailto:mbo...@redhat.com wrote:
 
 Gary Kotton came across a doozy of a bug recently:
 
 https://bugs.launchpad.net/nova/+bug/1419785
 
 In short, when you start a Nova compute, it will query the driver for
 instances and compare that against the expected host of the the instance
 according to the DB. If the driver is reporting an instance the DB
 thinks is on a different host, it assumes the instance was evacuated
 while Nova compute was down, and deletes it on the hypervisor. However,
 Gary found that you trigger this when starting up a backup HA node which
 has a different `host` config setting. i.e. You fail over, and the first
 thing it does is delete all your instances.
 
 Gary and I both agree on a couple of things:
 
 1. Deleting all your instances is bad
 2. HA nova compute is highly desirable for some drivers
 
 
 There is a deeper issue here, that we are trying to work around.  Nova
 was never designed to have entire systems running behind a nova-compute.
 It was designed to have one nova-compute per 'physical box that runs
 instances'
 
 There have been many discussions in the past on how to fix this issue
 (by adding a new point in nova where clustered systems can plug in), but
 if I remember correctly the gotcha was no one was willing to step up to
 do it.

There are 2 unrelated concepts of clusters here. The VMware driver has
both, which seems to result in some confusion. As it happens, this issue
doesn't relate to either of them.

Firstly, there's a VMware cluster. This presents itself, and is managed
as one, single hypervisor. The only issue Nova has with VMware clusters
is in resource tracker, because its resources aren't contiguous. i.e.
It's an accounting issue. It would be good to have a solution to this,
but it doesn't seem to be causing many real world problems.

Secondly there's the concept of 'nodes', whereby 1 nova compute can
manage multiple hypervisors. On VMware this means managing multiple
clusters, because 1 VMware cluster == 1 hypervisor. Both the Ironic and
VMware drivers can do this.

This issue relates to the co-location of nova compute with the
hypervisor. In the case of both Ironic and VMware, it is not possible to
co-locate nova compute with the hypervisor. That means that nova compute
must exist separately and be pointed at the hypervisor, which raises the
possibility that 2 different nova computes might accidentally be pointed
at the same hypervisor. As Gary discovered, this makes bad things
happen. Note that no clusters of either kind described above are
required to trigger this bug.

I have a new patch for this here, btw:
https://review.openstack.org/#/c/158269/ . I'd be grateful for more eyes
on it.

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute

2015-02-25 Thread Matthew Booth
On 25/02/15 11:51, Radoslav Gerganov wrote:
 On 02/23/2015 03:18 PM, Matthew Booth wrote:
 On 23/02/15 12:13, Gary Kotton wrote:


 On 2/23/15, 2:05 PM, Matthew Booth mbo...@redhat.com wrote:

 On 20/02/15 11:48, Matthew Booth wrote:
 Gary Kotton came across a doozy of a bug recently:

 https://bugs.launchpad.net/nova/+bug/1419785

 In short, when you start a Nova compute, it will query the driver for
 instances and compare that against the expected host of the the
 instance
 according to the DB. If the driver is reporting an instance the DB
 thinks is on a different host, it assumes the instance was evacuated
 while Nova compute was down, and deletes it on the hypervisor.
 However,
 Gary found that you trigger this when starting up a backup HA node
 which
 has a different `host` config setting. i.e. You fail over, and the
 first
 thing it does is delete all your instances.

 Gary and I both agree on a couple of things:

 1. Deleting all your instances is bad
 2. HA nova compute is highly desirable for some drivers

 We disagree on the approach to fixing it, though. Gary posted this:

 https://review.openstack.org/#/c/154029/

 I've already outlined my objections to this approach elsewhere, but to
 summarise I think this fixes 1 symptom of a design problem, and leaves
 the rest untouched. If the value of nova compute's `host` changes,
 then
 the assumption that instances associated with that compute can be
 identified by the value of instance.host becomes invalid. This
 assumption is pervasive, so it breaks a lot of stuff. The worst one is
 _destroy_evacuated_instances(), which Gary found, but if you scan
 nova/compute/manager for the string 'self.host' you'll find lots of
 them. For example, all the periodic tasks are broken, including image
 cache management, and the state of ResourceTracker will be unusual.
 Worse, whenever a new instance is created it will have a different
 value
 of instance.host, so instances running on a single hypervisor will
 become partitioned based on which nova compute was used to create
 them.

 In short, the system may appear to function superficially, but it's
 unsupportable.

 I had an alternative idea. The current assumption is that the `host`
 managing a single hypervisor never changes. If we break that
 assumption,
 we break Nova, so we could assert it at startup and refuse to start if
 it's violated. I posted this VMware-specific POC:

 https://review.openstack.org/#/c/154907/

 However, I think I've had a better idea. Nova creates ComputeNode
 objects for its current configuration at startup which, amongst other
 things, are a map of host:hypervisor_hostname. We could assert when
 creating a ComputeNode that hypervisor_hostname is not already
 associated with a different host, and refuse to start if it is. We
 would
 give an appropriate error message explaining that this is a
 misconfiguration. This would prevent the user from hitting any of the
 associated problems, including the deletion of all their instances.

 I have posted a patch implementing the above for review here:

 https://review.openstack.org/#/c/158269/

 I have to look at what you have posted. I think that this topic is
 something that we should speak about at the summit and this should fall
 under some BP and well defined spec. I really would not like to see
 existing installations being broken if and when this patch lands. It may
 also affect Ironic as it works on the same model.

 This patch will only affect installations configured with multiple
 compute hosts for a single hypervisor. These are already broken, so this
 patch will at least let them know if they haven't already noticed.

 It won't affect Ironic, because they configure all compute hosts to have
 the same 'host' value. An Ironic user would only notice this patch if
 they accidentally misconfigured it, which is the intended behaviour.

 Incidentally, I also support more focus on the design here. Until we
 come up with a better design, though, we need to do our best to prevent
 non-trivial corruption from a trivial misconfiguration. I think we need
 to merge this, or something like it, now and still have a summit
 discussion.

 Matt

 
 Hi Matt,
 
 I already posted a comment on your patch but I'd like to reiterate here
 as well.  Currently the VMware driver is using the cluster name as
 hypervisor_hostname which is a problem because you can have different
 clusters with the same name.  We already have a critical bug filed for
 this:
 
 https://bugs.launchpad.net/nova/+bug/1329261
 
 There was an attempt to fix this by using a combination of vCenter UUID
 + cluster_name but it was rejected because this combination was not
 considered a 'real' hostname.  I think that if we go for a DB schema
 change we can fix both issues by renaming hypervisor_hostname to
 hypervisor_id and make it unique.  What do you think?

Well, I think hypervisor_id makes more sense than hypervisor_hostname.
The latter is a confusing. However, I'd prefer

Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute

2015-02-23 Thread Matthew Booth
On 20/02/15 11:48, Matthew Booth wrote:
 Gary Kotton came across a doozy of a bug recently:
 
 https://bugs.launchpad.net/nova/+bug/1419785
 
 In short, when you start a Nova compute, it will query the driver for
 instances and compare that against the expected host of the the instance
 according to the DB. If the driver is reporting an instance the DB
 thinks is on a different host, it assumes the instance was evacuated
 while Nova compute was down, and deletes it on the hypervisor. However,
 Gary found that you trigger this when starting up a backup HA node which
 has a different `host` config setting. i.e. You fail over, and the first
 thing it does is delete all your instances.
 
 Gary and I both agree on a couple of things:
 
 1. Deleting all your instances is bad
 2. HA nova compute is highly desirable for some drivers
 
 We disagree on the approach to fixing it, though. Gary posted this:
 
 https://review.openstack.org/#/c/154029/
 
 I've already outlined my objections to this approach elsewhere, but to
 summarise I think this fixes 1 symptom of a design problem, and leaves
 the rest untouched. If the value of nova compute's `host` changes, then
 the assumption that instances associated with that compute can be
 identified by the value of instance.host becomes invalid. This
 assumption is pervasive, so it breaks a lot of stuff. The worst one is
 _destroy_evacuated_instances(), which Gary found, but if you scan
 nova/compute/manager for the string 'self.host' you'll find lots of
 them. For example, all the periodic tasks are broken, including image
 cache management, and the state of ResourceTracker will be unusual.
 Worse, whenever a new instance is created it will have a different value
 of instance.host, so instances running on a single hypervisor will
 become partitioned based on which nova compute was used to create them.
 
 In short, the system may appear to function superficially, but it's
 unsupportable.
 
 I had an alternative idea. The current assumption is that the `host`
 managing a single hypervisor never changes. If we break that assumption,
 we break Nova, so we could assert it at startup and refuse to start if
 it's violated. I posted this VMware-specific POC:
 
 https://review.openstack.org/#/c/154907/
 
 However, I think I've had a better idea. Nova creates ComputeNode
 objects for its current configuration at startup which, amongst other
 things, are a map of host:hypervisor_hostname. We could assert when
 creating a ComputeNode that hypervisor_hostname is not already
 associated with a different host, and refuse to start if it is. We would
 give an appropriate error message explaining that this is a
 misconfiguration. This would prevent the user from hitting any of the
 associated problems, including the deletion of all their instances.

I have posted a patch implementing the above for review here:

https://review.openstack.org/#/c/158269/

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute

2015-02-23 Thread Matthew Booth
On 23/02/15 12:13, Gary Kotton wrote:
 
 
 On 2/23/15, 2:05 PM, Matthew Booth mbo...@redhat.com wrote:
 
 On 20/02/15 11:48, Matthew Booth wrote:
 Gary Kotton came across a doozy of a bug recently:

 https://bugs.launchpad.net/nova/+bug/1419785

 In short, when you start a Nova compute, it will query the driver for
 instances and compare that against the expected host of the the instance
 according to the DB. If the driver is reporting an instance the DB
 thinks is on a different host, it assumes the instance was evacuated
 while Nova compute was down, and deletes it on the hypervisor. However,
 Gary found that you trigger this when starting up a backup HA node which
 has a different `host` config setting. i.e. You fail over, and the first
 thing it does is delete all your instances.

 Gary and I both agree on a couple of things:

 1. Deleting all your instances is bad
 2. HA nova compute is highly desirable for some drivers

 We disagree on the approach to fixing it, though. Gary posted this:

 https://review.openstack.org/#/c/154029/

 I've already outlined my objections to this approach elsewhere, but to
 summarise I think this fixes 1 symptom of a design problem, and leaves
 the rest untouched. If the value of nova compute's `host` changes, then
 the assumption that instances associated with that compute can be
 identified by the value of instance.host becomes invalid. This
 assumption is pervasive, so it breaks a lot of stuff. The worst one is
 _destroy_evacuated_instances(), which Gary found, but if you scan
 nova/compute/manager for the string 'self.host' you'll find lots of
 them. For example, all the periodic tasks are broken, including image
 cache management, and the state of ResourceTracker will be unusual.
 Worse, whenever a new instance is created it will have a different value
 of instance.host, so instances running on a single hypervisor will
 become partitioned based on which nova compute was used to create them.

 In short, the system may appear to function superficially, but it's
 unsupportable.

 I had an alternative idea. The current assumption is that the `host`
 managing a single hypervisor never changes. If we break that assumption,
 we break Nova, so we could assert it at startup and refuse to start if
 it's violated. I posted this VMware-specific POC:

 https://review.openstack.org/#/c/154907/

 However, I think I've had a better idea. Nova creates ComputeNode
 objects for its current configuration at startup which, amongst other
 things, are a map of host:hypervisor_hostname. We could assert when
 creating a ComputeNode that hypervisor_hostname is not already
 associated with a different host, and refuse to start if it is. We would
 give an appropriate error message explaining that this is a
 misconfiguration. This would prevent the user from hitting any of the
 associated problems, including the deletion of all their instances.

 I have posted a patch implementing the above for review here:

 https://review.openstack.org/#/c/158269/
 
 I have to look at what you have posted. I think that this topic is
 something that we should speak about at the summit and this should fall
 under some BP and well defined spec. I really would not like to see
 existing installations being broken if and when this patch lands. It may
 also affect Ironic as it works on the same model.

This patch will only affect installations configured with multiple
compute hosts for a single hypervisor. These are already broken, so this
patch will at least let them know if they haven't already noticed.

It won't affect Ironic, because they configure all compute hosts to have
the same 'host' value. An Ironic user would only notice this patch if
they accidentally misconfigured it, which is the intended behaviour.

Incidentally, I also support more focus on the design here. Until we
come up with a better design, though, we need to do our best to prevent
non-trivial corruption from a trivial misconfiguration. I think we need
to merge this, or something like it, now and still have a summit discussion.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope

2015-02-23 Thread Matthew Booth
On 20/02/15 20:15, Andrew Bogott wrote:
 On 2/20/15 9:06 AM, Mike Dorman wrote:
 I can report that we do use this option (‘global' setting.)  We have to
 enforce name uniqueness for instances’ integration with some external
 systems (namely AD and Spacewalk) which require unique naming.

 However, we also do some external name validation which I think
 effectively enforces uniqueness as well.  So if this were deprecated, I
 don’t know if it would directly affect us for our specific situation.

 Other operators, anyone else using
 osapi_compute_unique_server_name_scope?
 I use it!  And, in fact, added it in the first place :(
 
 I have no real recall of what we concluded when originally discussing
 the associated race.  The feature is useful to me and I'd love it if it
 could be moved into the db to fix the race, but I concede that it's a
 pretty big can o' worms, so if no one else cries out in pain I can live
 with it being deprecated.

Ok, with 2 identified users I'm going to abandon my patch to deprecate
with no replacement. My current plan is to leave the race there with a
comment in the code and the config documentation. Perhaps we can fix
this properly at some point when we get the online schema changes, but
for the moment it seems like a lot of complication for a relatively
small problem.

Do you use the global or project scope, btw?

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute

2015-02-20 Thread Matthew Booth
Gary Kotton came across a doozy of a bug recently:

https://bugs.launchpad.net/nova/+bug/1419785

In short, when you start a Nova compute, it will query the driver for
instances and compare that against the expected host of the the instance
according to the DB. If the driver is reporting an instance the DB
thinks is on a different host, it assumes the instance was evacuated
while Nova compute was down, and deletes it on the hypervisor. However,
Gary found that you trigger this when starting up a backup HA node which
has a different `host` config setting. i.e. You fail over, and the first
thing it does is delete all your instances.

Gary and I both agree on a couple of things:

1. Deleting all your instances is bad
2. HA nova compute is highly desirable for some drivers

We disagree on the approach to fixing it, though. Gary posted this:

https://review.openstack.org/#/c/154029/

I've already outlined my objections to this approach elsewhere, but to
summarise I think this fixes 1 symptom of a design problem, and leaves
the rest untouched. If the value of nova compute's `host` changes, then
the assumption that instances associated with that compute can be
identified by the value of instance.host becomes invalid. This
assumption is pervasive, so it breaks a lot of stuff. The worst one is
_destroy_evacuated_instances(), which Gary found, but if you scan
nova/compute/manager for the string 'self.host' you'll find lots of
them. For example, all the periodic tasks are broken, including image
cache management, and the state of ResourceTracker will be unusual.
Worse, whenever a new instance is created it will have a different value
of instance.host, so instances running on a single hypervisor will
become partitioned based on which nova compute was used to create them.

In short, the system may appear to function superficially, but it's
unsupportable.

I had an alternative idea. The current assumption is that the `host`
managing a single hypervisor never changes. If we break that assumption,
we break Nova, so we could assert it at startup and refuse to start if
it's violated. I posted this VMware-specific POC:

https://review.openstack.org/#/c/154907/

However, I think I've had a better idea. Nova creates ComputeNode
objects for its current configuration at startup which, amongst other
things, are a map of host:hypervisor_hostname. We could assert when
creating a ComputeNode that hypervisor_hostname is not already
associated with a different host, and refuse to start if it is. We would
give an appropriate error message explaining that this is a
misconfiguration. This would prevent the user from hitting any of the
associated problems, including the deletion of all their instances.

We can still do active/passive HA!

If we configure both nodes in the active/passive cluster identically,
including with the same value of `host`, I don't see why this shouldn't
work today. I don't even think the configuration is onerous. All we
would be doing is preventing the user from accidentally running a
misconfigured HA which leads to inconsistent state, and will eventually
require manual cleanup.

We would still have to be careful that we don't bring up both nova
computes simultaneously. The VMware driver, at least, has hardcoded
assumptions that it is the only writer in certain circumstances. That
problem would have to be handled separately, perhaps at the messaging layer.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope

2015-02-20 Thread Matthew Booth
On 19/02/15 18:57, Jay Pipes wrote:
 On 02/19/2015 05:18 AM, Matthew Booth wrote:
 Nova contains a config variable osapi_compute_unique_server_name_scope.
 Its help text describes it pretty well:

 When set, compute API will consider duplicate hostnames invalid within
 the specified scope, regardless of case. Should be empty, project or
 global.

 So, by default hostnames are not unique, but depending on this setting
 they could be unique either globally or in the scope of a project.

 Ideally a unique constraint would be enforced by the database but,
 presumably because this is a config variable, that isn't the case here.
 Instead it is enforced in code, but the code which does this predictably
 races. My first attempt to fix this using the obvious SQL solution
 appeared to work, but actually fails in MySQL as it doesn't support that
 query structure[1][2]. SQLite and PostgreSQL do support it, but they
 don't support the query structure which MySQL supports. Note that this
 isn't just a syntactic thing. It looks like it's still possible to do
 this if we compound the workaround with a second workaround, but I'm
 starting to wonder if we'd be better fixing the design.

 First off, do we need this config variable? Is anybody actually using
 it? I suspect the answer's going to be yes, but it would be extremely
 convenient if it's not.

 Assuming this configurability is required, is there any way we can
 instead use it to control a unique constraint in the db at service
 startup? This would be something akin to a db migration. How do we
 manage those?

 Related to the above, I'm not 100% clear on which services run this
 code. Is it possible for different services to have a different
 configuration of this variable, and does that make sense? If so, that
 would preclude a unique constraint in the db.
 
 Is there a bug that you are attempting to fix here? If not, I'd suggest
 just leaving this code as it is...

The bug is the race. If a user sets
osapi_compute_unique_server_name_scope they're presumably expecting the
associating uniqueness guarantees, but we're not providing them.

John suggested I deprecate it and see who complains:

https://review.openstack.org/157731

In the mean time, I'm starting to think the most prudent course of
action would be to not fix the race: it's not the most important race in
that function, it's been broken for a long time, and I suspect few
people are using it. We could document that it's broken... In fact, I
might add that to the deprecation notice.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope

2015-02-19 Thread Matthew Booth
Nova contains a config variable osapi_compute_unique_server_name_scope.
Its help text describes it pretty well:

When set, compute API will consider duplicate hostnames invalid within
the specified scope, regardless of case. Should be empty, project or
global.

So, by default hostnames are not unique, but depending on this setting
they could be unique either globally or in the scope of a project.

Ideally a unique constraint would be enforced by the database but,
presumably because this is a config variable, that isn't the case here.
Instead it is enforced in code, but the code which does this predictably
races. My first attempt to fix this using the obvious SQL solution
appeared to work, but actually fails in MySQL as it doesn't support that
query structure[1][2]. SQLite and PostgreSQL do support it, but they
don't support the query structure which MySQL supports. Note that this
isn't just a syntactic thing. It looks like it's still possible to do
this if we compound the workaround with a second workaround, but I'm
starting to wonder if we'd be better fixing the design.

First off, do we need this config variable? Is anybody actually using
it? I suspect the answer's going to be yes, but it would be extremely
convenient if it's not.

Assuming this configurability is required, is there any way we can
instead use it to control a unique constraint in the db at service
startup? This would be something akin to a db migration. How do we
manage those?

Related to the above, I'm not 100% clear on which services run this
code. Is it possible for different services to have a different
configuration of this variable, and does that make sense? If so, that
would preclude a unique constraint in the db.

Thanks,

Matt

[1] Which has prompted me to get the test_db_api tests running on MySQL.
See this series if you're interested:
https://review.openstack.org/#/c/156299/

[2] For specifics, see my ramblings here:
https://review.openstack.org/#/c/141115/7/nova/db/sqlalchemy/api.py,cm
line 2547
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Libguestfs: possibility not to use it, even when installed ?

2015-02-19 Thread Matthew Booth
On 18/02/15 18:23, Raphael Glon wrote:
 Hi,
 
 This is about review:
 https://review.openstack.org/#/c/156633/
 
 1 line, can be controversial
 
 Its purpose is to add the possibility not to use libguestfs for data
 injection in nova, even when installed.
 
 Not discussing about the fact that libguestfs should be preferred over
 fuse mounts for data injection as much as possible because mounts are
 more subject to causing security issues (and already have in the past
 nova releases).
 
 However, there are a lot of potential cases when libguestfs won't be
 usable for data injection
 
 This was the case here (fixed):
 https://bugzilla.redhat.com/show_bug.cgi?id=984409
 
 I entcountered a similar case more recently on powerkvm 2.1.0 (defect
 with the libguestfs)
 
 So just saying it could be good adding a simple config flag (set to True
 by default, to keep the current behaviour untouched) to force nova not
 using libguestfs without having to uninstall it and thus prevent other
 users on the host from using it.

A big -1 to this. You seem to be saying that there were bugs in a
library, which were fixed. News at 11. This isn't a realistic way to
manage a large software stack.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][vmware] HA and image caching

2015-02-19 Thread Matthew Booth
I was discussing the problems of configuring active/passive HA earlier
where the active and passive nodes have a different `host` configured.
There are lots of problems, as I've mentioned before, but Gary pointed
out that it probably affects image cache management, too. This had the
potential for unsubtle breakage, so I decided to check it out in a bit
more detail. The bug is real, but fortunately doesn't delete anything
(which is itself a bug):

  storage_users.register_storage_use(CONF.instances_path, CONF.host)
  nodes = storage_users.get_storage_users(CONF.instances_path)

This bit of code assumes shared storage between different nova nodes
which access the same images. We don't have that, so `nodes` above is
going to contain only the current node.

  filters = {'deleted': False,
 'soft_deleted': True,
 'host': nodes}
  filtered_instances = objects.InstanceList.get_by_filters(context,
   filters, expected_attrs=[], use_slave=True)

We're getting a list of instances which were created by the current node
only.

  self.driver.manage_image_cache(context, filtered_instances)

Which we're passing to the driver.

So, the driver is only going to consider images attached to instances
which were created by the current node. The effect of this is that just
after a switchover, cache management will not consider any instances,
and therefore nothing will be expired. This is still bad, but it's not
going to delete users' stuff.

Incidentally, I remain convinced that we can safely configure
active/passive HA if we configure all nodes with the same `host`. That
way we wouldn't have to worry about all these edge cases, and we
wouldn't need to eventually create a cleanup job to consolidate all
instances running on a single hypervisor to have the same 'host'.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames

2015-02-11 Thread Matthew Booth
On 11/02/15 15:49, Gary Kotton wrote:
 Hi,
 I do not think that that is a healthy solution. That effectively would
 render a cluster down if the compute node goes down. That would be a real
 disaster. The ugly work around is setting the host names to be the same
 value.

I don't think that's an ugly work around. I think that's the only
currently viable solution.

 This is something that we should discuss at the next summit and I would
 hope to propose a topic to talk about.

Sounds like a good plan. However, given that the bug is marked Critical
I was assuming we wanted a more expedient fix, which is what I've proposed.

Matt

 Thanks
 Gary
 
 On 2/11/15, 5:31 PM, Matthew Booth mbo...@redhat.com wrote:
 
 I just posted this:

 https://review.openstack.org/#/c/154907/

 as an alternative fix for critical bug:

 https://bugs.launchpad.net/nova/+bug/1419785

 I've just knocked this up quickly for illustration: it obviously needs
 plenty of cleanup. I have confirmed that it works, though.

 Before I take it any further, though, I'd like to get some feedback on
 the approach. I prefer this to the alternative, because the underlying
 problem is deeper than supporting evacuate. I'd prefer to be honest with
 the user and just say it ain't gonna work. The alternative would leave
 Nova running in a broken state, leaving inconsistent state in its wake
 as it runs.

 Matt

 -- 
 Matthew Booth
 Red Hat Engineering, Virtualisation Team

 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames

2015-02-11 Thread Matthew Booth
On 11/02/15 16:40, Gary Kotton wrote:
 
 
 On 2/11/15, 6:35 PM, Sylvain Bauza sba...@redhat.com wrote:
 

 Le 11/02/2015 17:04, Gary Kotton a écrit :
 I posted a fix that does not break things and supports HA.
 https://review.openstack.org/154029


 Just let's be clear, HA is *not* supported by Nova now.
 
 That is not correct. It is actually support if the host_ip is the same. If
 the host_ip is not the same then there is an issue when one of the compute
 nodes restarts - it will delete all instances that do not have its host_ip.

FWIW, I suspect you're correct. My patch enforces that.

So, check out the init code from compute manager:

def init_host(self):
Initialization for a standalone compute service.
self.driver.init_host(host=self.host)

The driver is being configured with the Compute service's 'host' config
variable. Let's scan through and look what else uses that:


def _update_resource_tracker(self, context, instance):
Let the resource tracker know that an instance has changed
state.

if (instance['host'] == self.host and
self.driver.node_is_available(instance['node'])):

Instances with the old host value will be ignored by
_update_resource_tracker. That doesn't sound good.

There's _destroy_evacuated_instances(), which you already found :)

There's this in _validate_instance_group_policy:

group_hosts = group.get_hosts(context, exclude=[instance.uuid])
if self.host in group_hosts:
msg = _(Anti-affinity instance group policy was violated.)

You've changed group affinity.

There's _check_instance_build_time:

filters = {'vm_state': vm_states.BUILDING,
   'host': self.host}

Nova won't check instances created by the old host to see if they're stuck.

There's this in _heal_instance_info_cache:

db_instances = objects.InstanceList.get_by_host(
context, self.host, expected_attrs=[], use_slave=True)

This cleanup job won't find instances created by the old host.

There's this in _poll_rebooting_instances:

filters = {'task_state':
   [task_states.REBOOTING,
task_states.REBOOT_STARTED,
task_states.REBOOT_PENDING],
   'host': self.host}

Nova won't poll instances created by the old host.

This is just a cursory flick through. I'm fairly sure this is going to
be a lot of work to fix. My patch just ensures that Nova refuses to
start instead of letting bad things happen. If you ensure that
'self.host' in the above code is the same for all HA nodes I don't see
why it shouldn't work, though. My patch won't prevent that.

Matt

 

 The main reason is that compute *nodes* are considered given by the
 hypervisor (ie. the virt driver ran by the compute manager worker), so
 if 2 or more hypervisors on two distinct machines are getting the same
 list of nodes, then you would have duplicates.
 
 No. There are no duplicates.
 

 -Sylvain

 On 2/11/15, 5:55 PM, Matthew Booth mbo...@redhat.com wrote:

 On 11/02/15 15:49, Gary Kotton wrote:
 Hi,
 I do not think that that is a healthy solution. That effectively would
 render a cluster down if the compute node goes down. That would be a
 real
 disaster. The ugly work around is setting the host names to be the
 same
 value.
 I don't think that's an ugly work around. I think that's the only
 currently viable solution.

 This is something that we should discuss at the next summit and I
 would
 hope to propose a topic to talk about.
 Sounds like a good plan. However, given that the bug is marked Critical
 I was assuming we wanted a more expedient fix, which is what I've
 proposed.

 Matt

 Thanks
 Gary

 On 2/11/15, 5:31 PM, Matthew Booth mbo...@redhat.com wrote:

 I just posted this:

 https://review.openstack.org/#/c/154907/

 as an alternative fix for critical bug:

 https://bugs.launchpad.net/nova/+bug/1419785

 I've just knocked this up quickly for illustration: it obviously
 needs
 plenty of cleanup. I have confirmed that it works, though.

 Before I take it any further, though, I'd like to get some feedback
 on
 the approach. I prefer this to the alternative, because the
 underlying
 problem is deeper than supporting evacuate. I'd prefer to be honest
 with
 the user and just say it ain't gonna work. The alternative would
 leave
 Nova running in a broken state, leaving inconsistent state in its
 wake
 as it runs.

 Matt

 -- 
 Matthew Booth
 Red Hat Engineering, Virtualisation Team

 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490



 __
 __
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames

2015-02-11 Thread Matthew Booth
I just posted this:

https://review.openstack.org/#/c/154907/

as an alternative fix for critical bug:

https://bugs.launchpad.net/nova/+bug/1419785

I've just knocked this up quickly for illustration: it obviously needs
plenty of cleanup. I have confirmed that it works, though.

Before I take it any further, though, I'd like to get some feedback on
the approach. I prefer this to the alternative, because the underlying
problem is deeper than supporting evacuate. I'd prefer to be honest with
the user and just say it ain't gonna work. The alternative would leave
Nova running in a broken state, leaving inconsistent state in its wake
as it runs.

Matt

-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-11 Thread Matthew Booth
On 10/02/15 18:29, Jay Pipes wrote:
 On 02/10/2015 09:47 AM, Matthew Booth wrote:
 On 09/02/15 18:15, Jay Pipes wrote:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
 I do not see why not to use `FOR UPDATE` even with multi-writer or
 Is the retry/swap way really solves anything here.
 snip
 Am I missed something ?

 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
 that are needed to support SELECT FOR UPDATE statements across multiple
 cluster nodes.

 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ


 Is that the right link, Jay? I'm taking your word on the write-intent
 locks not being replicated, but that link seems to say the opposite.
 
 This link is better:
 
 http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/
 
 
 Specifically the line:
 
 The local record lock held by the started transation on pxc1 didn’t
 play any part in replication or certification (replication happens at
 commit time, there was no commit there yet).

Thanks, Jay, that's a great article.

Based on that, I think I may have misunderstood what you were saying
before. I currently understand that the behaviour of select ... for
update is correct on Galera, it's just not very efficient. Correct in
this case meaning it aborts the transaction due to a correctly detected
lock conflict.

FWIW, that was pretty much my original understanding, but without the
detail.

To expand: Galera doesn't replicate write intent locks, but it turns out
it doesn't have to for correctness. The reason is that the conflict
between a local write intent lock and a remote write, which is
replicated, will always be detected during or before local certification.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Matthew Booth
On 09/02/15 18:15, Jay Pipes wrote:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
 I do not see why not to use `FOR UPDATE` even with multi-writer or
 Is the retry/swap way really solves anything here.
 snip
 Am I missed something ?
 
 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
 that are needed to support SELECT FOR UPDATE statements across multiple
 cluster nodes.
 
 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ

Is that the right link, Jay? I'm taking your word on the write-intent
locks not being replicated, but that link seems to say the opposite.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Theory of Testing Cross Project Spec

2015-02-09 Thread Matthew Booth
On 09/02/15 13:59, Sean Dague wrote:
 Comments are welcomed, please keep typos / grammar as '0' comments to
 separate them from -1 comments. Also, I ask any -1s to be extremely
 clear about the core concern of the -1 so we can figure out how to make
 progress.

Incidentally, I think these are good guidelines for comments on all reviews.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [all][oslo.db] Repeatable Read considered harmful

2015-02-06 Thread Matthew Booth
I was surprised recently to discover that MySQL uses repeatable read for
transactions by default. Postgres uses read committed by default, and
SQLite uses serializable. We don't set the isolation level explicitly
anywhere, so our applications are running under different isolation
levels depending on backend. This doesn't sound like a good idea to me.
It's one thing to support multiple sql syntaxes, but different isolation
levels have different semantics. Supporting that is much harder, and
currently we're not even trying.

I'm aware that the same isolation level on different databases will
still have subtly different semantics, but at least they should agree on
the big things. I think we should pick one, and it should be read committed.

Also note that 'repeatable read' on both MySQL and Postgres is actually
snapshot isolation, which isn't quite the same thing. For example, it
doesn't get phantom reads.

The most important reason I think we need read committed is recovery
from concurrent changes within the scope of a single transaction. To
date, in Nova at least, this hasn't been an issue as transactions have
had an extremely small scope. However, we're trying to expand that scope
with the new enginefacade in oslo.db:
https://review.openstack.org/#/c/138215/ . With this expanded scope,
transaction failure in a library function can't simply be replayed
because the transaction scope is larger than the function.

So, 3 concrete examples of how repeatable read will make Nova worse:

* https://review.openstack.org/#/c/140622/

This was committed to Nova recently. Note how it involves a retry in the
case of concurrent change. This works fine, because the retry is creates
a new transaction. However, if the transaction was larger than the scope
of this function this would not work, because each iteration would
continue to read the old data. The solution to this is to create a new
transaction. However, because the transaction is outside of the scope of
this function, the only thing we can do locally is fail. The caller then
has to re-execute the whole transaction, or fail itself.

This is a local concurrency problem which can be very easily handled
locally, but not if we're using repeatable read.

*
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L4749

Nova has multiple functions of this type which attempt to update a
key/value metadata table. I'd expect to find multiple concurrency issues
with this if I stopped to give it enough thought, but concentrating just
on what's there, notice how the retry loop starts a new transaction. If
we want to get to a place where we don't do that, with repeatable read
we're left failing the whole transaction.

* https://review.openstack.org/#/c/136409/

This one isn't upstream, yet. It's broken, and I can't currently think
of a solution if we're using repeatable read.

The issue is atomic creation of a shared resource. We want to handle a
creation race safely. This patch:

* Attempts to reads the default (it will normally exist)
* Creates a new one if it doesn't exist
* Goes back to the start if creation failed due to a duplicate

Seems fine, but it will fail because the re-read will continue to not
return the new value under repeatable read (no phantom reads). The only
way to see the new row is a new transaction. Is this will no longer be
in the scope of this function, the only solution will be to fail. Read
committed could continue without failing.

Incidentally, this currently works by using multiple transactions, which
we are trying to avoid. It has also been suggested that in this specific
instance the default security group could be created with the project.
However, that would both be more complicated, because it would require
putting a hook into another piece of code, and less robust, because it
wouldn't recover if somebody deleted the default security group.


To summarise, with repeatable read we're forced to abort the current
transaction to deal with certain relatively common classes of
concurrency issue, whereas with read committed we can safely recover. If
we want to reduce the number of transactions we're using, which we do,
the impact of this is going to dramatically increase. We should
standardise on read committed.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 04/02/15 19:04, Jay Pipes wrote:
 On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
 On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:
 I've spent a few hours today reading about Galera, a clustering solution
 for MySQL. Galera provides multi-master 'virtually synchronous'
 replication between multiple mysql nodes. i.e. I can create a cluster of
 3 mysql dbs and read and write from any of them with certain consistency
 guarantees.

 I am no expert[1], but this is a TL;DR of a couple of things which I
 didn't know, but feel I should have done. The semantics are important to
 application design, which is why we should all be aware of them.


 * Commit will fail if there is a replication conflict

 foo is a table with a single field, which is its primary key.

 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM

 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.
 
 It is a failure to certify the writeset, which bubbles up as an InnoDB
 deadlock error. See my article here:
 
 http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
 
 
 Which explains this.
 
 Yes ! and if I can add more information and I hope I do not make
 mistake I think it's a know issue which comes from MySQL, that is why
 we have a decorator to do a retry and so handle this case here:

   
 http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

 
 It's not an issue with MySQL. It's an issue with any database code that
 is highly contentious.
 
 Almost all highly distributed or concurrent applications need to handle
 deadlock issues, and the most common way to handle deadlock issues on
 database records is using a retry technique. There's nothing new about
 that with Galera.
 
 The issue with our use of the @_retry_on_deadlock decorator is *not*
 that the retry decorator is not needed, but rather it is used too
 frequently. The compare-and-swap technique I describe in the article
 above dramatically* reduces the number of deadlocks that occur (and need
 to be handled by the @_retry_on_deadlock decorator) and dramatically
 reduces the contention over critical database sections.

I'm still confused as to how this code got there, though. We shouldn't
be hitting Galera lock contention (reported as deadlocks) if we're using
a single master, which I thought we were. Does this mean either:

A. There are deployments using multi-master?
B. These are really deadlocks?

If A, is this something we need to continue to support?

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 11:11, Sahid Orentino Ferdjaoui wrote:
 I'm still confused as to how this code got there, though. We shouldn't
 be hitting Galera lock contention (reported as deadlocks) if we're using
 a single master, which I thought we were. Does this mean either:
 
 I guess we can hit a lock contention even in single master.

I don't think so, but you can certainly still have real deadlocks.
They're bugs, though.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] stuck patches at the nova IRC meeting

2015-02-05 Thread Matthew Booth
On 05/02/15 15:02, Alexis Lee wrote:
 Sean Dague said on Wed, Feb 04, 2015 at 09:51:30AM -0500:
 As there has been a bunch of concern around patches getting lost or
 stuck, I wanted to re-announce the fact that we've got a dedicated slot
 at the weekly Nova meeting for just those sorts of things.
 
 The slot turned into everyone talking at once fairly quickly this week.
 That led to several patches not getting a real discussion.
 
 May I suggest stricter moderation? EG a short phase to propose items,
 then work through them 1 by 1. Or, we take items one by one according to
 who shouts fastest but ask people not to interrupt.

+1

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 04:30, Mike Bayer wrote:
 Galera doesn't change anything here. I'm really not sure what the
 fuss is about, frankly.
 
 because we’re trying to get Galera to actually work as a load
 balanced cluster to some degree, at least for reads.

Yeah, the use case of concern here is consecutive RPC transactions from
a single remote client, which can't reasonably be in the same
transaction. This affects semantics visible to the end-user.

In Nova, they might do:

$ nova aggregate-create ...
$ nova aggregate-details ...

Should they expect that the second command might fail if they don't
pause long enough between the 2? Should they retry until it succeeds?
This example is a toy, but I would expect to find many other more subtle
examples.

 Otherwise I’m not really sure why we have to bother with Galera at
 all.  If we just want a single MySQL server that has a warm standby
 for failover, why aren’t we just using that capability straight from
 MySQL.  Then we get “SELECT FOR UPDATE” and everything else back.

Actually I think this is a misconception. If I have understood
correctly[1], Galera *does* work with select for update. Use of select
for update on a single node will work exactly as normal with blocking
behaviour. Use of select for update across 2 nodes will not block, but
fail on commit if there was lock contention.

 Galera’s “multi master” capability is already in the trash for us,
 and it seems like “multi-slave” is only marginally useful either, the
 vast majority of openstack has to be 100% pointed at just one node to
 work correctly.

It's not necessarily in the trash, but given that the semantics are
different (fail on commit rather than block) we'd need to do more work
to support them. It sounds to me that we want to defer that rather than
try to fix it now. i.e. multi-master is currently unsupport(ed|able).

We could add an additional decorator to enginefacade which would
re-execute a @writer block if it detected Galera lock contention.
However, given that we'd have to audit that code for other side-effects,
for the moment it sounds like it's safer to fail.

Matt

[1] Standard caveats apply.
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 11:01, Attila Fazekas wrote:
 I have a question related to deadlock handling as well.
 
 Why the DBDeadlock exception is not caught generally for all api/rpc request ?
 
 The mysql recommendation regarding to Deadlocks [1]:
 Normally, you must write your applications so that they are always 
  prepared to re-issue a transaction if it gets rolled back because of a 
 deadlock.

This is evil imho, although it may well be pragmatic. A deadlock (a real
deadlock, that is) occurs because of a preventable bug in code. It
occurs because 2 transactions have attempted to take multiple locks in a
different order. Getting this right is hard, but it is achievable. The
solution to real deadlocks is to fix the bugs.

Galera 'deadlocks' on the other hand are not deadlocks, despite being
reported as such (sounds as though this is due to an implementation
quirk?). They don't involve 2 transactions holding mutual locks, and
there is never any doubt about how to proceed. They involve 2
transactions holding the same lock, and 1 of them committed first. In a
real deadlock they wouldn't get as far as commit. This isn't any kind of
bug: it's normal behaviour in this environment and you just have to
handle it.

 Now the services are just handling the DBDeadlock in several places.
 We have some logstash hits for other places even without galera.

I haven't had much success with logstash. Could you post a query which
would return these? This would be extremely interesting.

 Instead of throwing 503 to the end user, the request could be repeated 
 `silently`.
 
 The users would be able repeat the request himself,
 so the automated repeat should not cause unexpected new problem.

Good point: we could argue 'no worse than now', even if it's buggy.

 The retry limit might be configurable, the exception needs to be watched 
 before
 anything sent to the db on behalf of the transaction or request.
 
 Considering all request handler as potential deadlock thrower seams much 
 easier than,
 deciding case by case.  

Well this happens at the transaction level, and we don't quite have a
1:1 request:transaction relationship. We're moving towards it, but
potentially long running requests will always have to use multiple
transactions.

However, I take your point. I think retry on transaction failure is
something which would benefit from standard handling in a library.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote:
 * Commit will fail if there is a replication conflict

 foo is a table with a single field, which is its primary key.

 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
   report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM

 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.
 
 Yes ! and if I can add more information and I hope I do not make
 mistake I think it's a know issue which comes from MySQL, that is why
 we have a decorator to do a retry and so handle this case here:
 
   
 http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

Right, and that remains a significant source of confusion and
obfuscation in the db api. Our db code is littered with races and
potential actual deadlocks, but only some functions are decorated. Are
they decorated because of real deadlocks, or because of Galera lock
contention? The solutions to those 2 problems are very different! Also,
hunting deadlocks is hard enough work. Adding the possibility that they
might not even be there is just evil.

Incidentally, we're currently looking to replace this stuff with some
new code in oslo.db, which is why I'm looking at it.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Matthew Booth
I've spent a few hours today reading about Galera, a clustering solution
for MySQL. Galera provides multi-master 'virtually synchronous'
replication between multiple mysql nodes. i.e. I can create a cluster of
3 mysql dbs and read and write from any of them with certain consistency
guarantees.

I am no expert[1], but this is a TL;DR of a couple of things which I
didn't know, but feel I should have done. The semantics are important to
application design, which is why we should all be aware of them.


* Commit will fail if there is a replication conflict

foo is a table with a single field, which is its primary key.

A: start transaction;
B: start transaction;
A: insert into foo values(1);
B: insert into foo values(1); -- 'regular' DB would block here, and
  report an error on A's commit
A: commit; -- success
B: commit; -- KABOOM

Confusingly, Galera will report a 'deadlock' to node B, despite this not
being a deadlock by any definition I'm familiar with.

Essentially, anywhere that a regular DB would block, Galera will not
block transactions on different nodes. Instead, it will cause one of the
transactions to fail on commit. This is still ACID, but the semantics
are quite different.

The impact of this is that code which makes correct use of locking may
still fail with a 'deadlock'. The solution to this is to either fail the
entire operation, or to re-execute the transaction and all its
associated code in the expectation that it won't fail next time.

As I understand it, these can be eliminated by sending all writes to a
single node, although that obviously makes less efficient use of your
cluster.


* Write followed by read on a different node can return stale data

During a commit, Galera replicates a transaction out to all other db
nodes. Due to its design, Galera knows these transactions will be
successfully committed to the remote node eventually[2], but it doesn't
commit them straight away. The remote node will check these outstanding
replication transactions for write conflicts on commit, but not for
read. This means that you can do:

A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo; -- May not contain the value we inserted above[3]

This means that even for 'synchronous' slaves, if a client makes an RPC
call which writes a row to write master A, then another RPC call which
expects to read that row from synchronous slave node B, there's no
default guarantee that it'll be there.

Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
  try:
commit
  except:
rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.

Matt

[1] No, really: I just read a bunch of docs and blogs today. If anybody
who is an expert would like to validate/correct that would be great.

[2]
http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

[3]
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.db][nova] Use of asynchronous slaves in Nova (was: Deprecating use_slave in Nova)

2015-02-02 Thread Matthew Booth
On 30/01/15 19:06, Mike Bayer wrote:
 
 
 Matthew Booth mbo...@redhat.com wrote:
 
 At some point in the near future, hopefully early in L, we're intending
 to update Nova to use the new database transaction management in
 oslo.db's enginefacade.

 Spec:
 http://git.openstack.org/cgit/openstack/oslo-specs/plain/specs/kilo/make-enginefacade-a-facade.rst

 Implementation:
 https://review.openstack.org/#/c/138215/

 One of the effects of this is that we will always know when we are in a
 read-only transaction, or a transaction which includes writes. We intend
 to use this new contextual information to make greater use of read-only
 slave databases. We are currently proposing that if an admin has
 configured a slave database, we will use the slave for *all* read-only
 transactions. This would make the use_slave parameter passed to some
 Nova apis redundant, as we would always use the slave where the context
 allows.

 However, using a slave database has a potential pitfall when mixed with
 separate write transactions. A caller might currently:

 1. start a write transaction
 2. update the database
 3. commit the transaction
 4. start a read transaction
 5. read from the database

 The client might expect data written in step 2 to be reflected in data
 read in step 5. I can think of 3 cases here:

 1. A short-lived RPC call is using multiple transactions

 This is a bug which the new enginefacade will help us eliminate. We
 should not be using multiple transactions in this case. If the reads are
 in the same transaction as the write: they will be on the master, they
 will be consistent, and there is no problem. As a bonus, lots of these
 will be race conditions, and we'll fix at least some.

 2. A long-lived task is using multiple transactions between long-running
 sub-tasks

 In this case, for example creating a new instance, we genuinely want
 multiple transactions: we don't want to hold a database transaction open
 while we copy images around. However, I can't immediately think of a
 situation where we'd write data, then subsequently want to read it back
 from the db in a read-only transaction. I think we will typically be
 updating state, meaning it's going to be a succession of write transactions.

 3. Separate RPC calls from a remote client

 This seems potentially problematic to me. A client makes an RPC call to
 create a new object. The client subsequently tries to retrieve the
 created object, and gets a 404.

 Summary: 1 is a class of bugs which we should be able to find fairly
 mechanically through unit testing. 2 probably isn't a problem in
 practise? 3 seems like a problem, unless consumers of cloud services are
 supposed to expect that sort of thing.

 I understand that slave databases can occasionally get very behind. How
 behind is this in practise?

 How do we use use_slave currently? Why do we need a use_slave parameter
 passed in via rpc, when it should be apparent to the developer whether a
 particular task is safe for out-of-date data.

 Any chance they have some kind of barrier mechanism? e.g. block until
 the current state contains transaction X.

 General comments on the usefulness of slave databases, and the
 desirability of making maximum use of them?
 
 keep in mind that the big win we get from writer()/ reader() is that
writer() can remain pointing to one node in a Galera cluster, and
reader() can point to the cluster as a whole. reader() by default should
definitely refer to the cluster as a whole, that is, “use slave”.
 
 As for issue #3, galera cluster is synchronous replication. Slaves
don’t get “behind” at all. So to the degree that we need to
transparently support some other kind of master/slave where slaves do
get behind, perhaps there would be a reader(synchronous_required=True)
kind of thing; based on configuration, it would be known that
“synchronous” either means we don’t care (using galera) or that we
should use the writer (an asynchronous replication scheme).

This sounds like the crux of the matter to me. After some (admittedly
cursory) reading, it seems that galera can use both synchronous and
asynchronous replication. Up until Friday I had only ever considered
synchronous replication, which would not be a problem.

I think opportunistically using synchronous slaves whenever possible
could only be a win. Are there any unpleasant practicalities which might
mean this isn't the case?

However, it sounds to me like there is at least some OpenStack
deployment in production using asynchronous slaves, otherwise the issue
of 'getting behind' wouldn't have come up. We need to understand:

* Are people actually using asynchronous slaves?
* If so, why did they choose to do that, and
* what are they using them for?

 
 All of this points to the fact that I really don’t think the
directives / flags should say anything about which specific database to
use; using a “slave” or not due to various concerns is dependent on
backend implementation and configuration. The purpose of reader

[openstack-dev] [oslo.db][nova] Deprecating use_slave in Nova

2015-01-30 Thread Matthew Booth
At some point in the near future, hopefully early in L, we're intending
to update Nova to use the new database transaction management in
oslo.db's enginefacade.

Spec:
http://git.openstack.org/cgit/openstack/oslo-specs/plain/specs/kilo/make-enginefacade-a-facade.rst

Implementation:
https://review.openstack.org/#/c/138215/

One of the effects of this is that we will always know when we are in a
read-only transaction, or a transaction which includes writes. We intend
to use this new contextual information to make greater use of read-only
slave databases. We are currently proposing that if an admin has
configured a slave database, we will use the slave for *all* read-only
transactions. This would make the use_slave parameter passed to some
Nova apis redundant, as we would always use the slave where the context
allows.

However, using a slave database has a potential pitfall when mixed with
separate write transactions. A caller might currently:

1. start a write transaction
2. update the database
3. commit the transaction
4. start a read transaction
5. read from the database

The client might expect data written in step 2 to be reflected in data
read in step 5. I can think of 3 cases here:

1. A short-lived RPC call is using multiple transactions

This is a bug which the new enginefacade will help us eliminate. We
should not be using multiple transactions in this case. If the reads are
in the same transaction as the write: they will be on the master, they
will be consistent, and there is no problem. As a bonus, lots of these
will be race conditions, and we'll fix at least some.

2. A long-lived task is using multiple transactions between long-running
sub-tasks

In this case, for example creating a new instance, we genuinely want
multiple transactions: we don't want to hold a database transaction open
while we copy images around. However, I can't immediately think of a
situation where we'd write data, then subsequently want to read it back
from the db in a read-only transaction. I think we will typically be
updating state, meaning it's going to be a succession of write transactions.

3. Separate RPC calls from a remote client

This seems potentially problematic to me. A client makes an RPC call to
create a new object. The client subsequently tries to retrieve the
created object, and gets a 404.

Summary: 1 is a class of bugs which we should be able to find fairly
mechanically through unit testing. 2 probably isn't a problem in
practise? 3 seems like a problem, unless consumers of cloud services are
supposed to expect that sort of thing.

I understand that slave databases can occasionally get very behind. How
behind is this in practise?

How do we use use_slave currently? Why do we need a use_slave parameter
passed in via rpc, when it should be apparent to the developer whether a
particular task is safe for out-of-date data.

Any chance they have some kind of barrier mechanism? e.g. block until
the current state contains transaction X.

General comments on the usefulness of slave databases, and the
desirability of making maximum use of them?

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo] [nova] [all] potential enginefacade adjustment - can everyone live with this?

2015-01-26 Thread Matthew Booth
On 23/01/15 19:47, Mike Bayer wrote:
 
 
 Doug Hellmann d...@doughellmann.com wrote:
 


 On Fri, Jan 23, 2015, at 12:49 PM, Mike Bayer wrote:
 Mike Bayer mba...@redhat.com wrote:

 Ihar Hrachyshka ihrac...@redhat.com wrote:

 On 01/23/2015 05:38 PM, Mike Bayer wrote:
 Doug Hellmann d...@doughellmann.com wrote:

 We put the new base class for RequestContext in its own library because
 both the logging and messaging code wanted to influence it's API. Would
 it make sense to do this database setup there, too?
 whoa, where’s that? is this an oslo-wide RequestContext class ? that 
 would
 solve everything b.c. right now every project seems to implement
 RequestContext themselves.


 so Doug -

 How does this “influence of API” occur, would oslo.db import
 oslo_context.context and patch onto RequestContext at that point? Or the
 other way around? Or… ?

 No, it's a social thing. I didn't want dependencies between
 oslo.messaging and oslo.log, but the API of the context needs to support
 use cases in both places.

 Your case might be different, in that we might need to actually have
 oslo.context depend on oslo.db in order to call some setup code. We'll
 have to think about whether that makes sense and what other dependencies
 it might introduce between the existing users of oslo.context.
 
 hey Doug -
 
 for the moment, I have oslo_db.sqlalchemy.enginefacade applying its 
 descriptors at import time onto oslo_context:
 
 https://review.openstack.org/#/c/138215/30/oslo_db/sqlalchemy/enginefacade.py
 
 https://review.openstack.org/gitweb?p=openstack/oslo.db.git;a=blob;f=oslo_db/sqlalchemy/enginefacade.py;h=3f76678a6c9788f62288c8fa5ef520db8dff2c0a;hb=bc33d20dc6db2f8e5f8cb02b4eb5f97d24dafb7a#l692
 
 https://review.openstack.org/gitweb?p=openstack/oslo.db.git;a=blob;f=oslo_db/sqlalchemy/enginefacade.py;h=3f76678a6c9788f62288c8fa5ef520db8dff2c0a;hb=bc33d20dc6db2f8e5f8cb02b4eb5f97d24dafb7a#l498

May I suggest that we decouple these changes by doing both? Oslo's
RequestContext object can have the enginefacade decorator applied to it,
so any project which uses it doesn't have to apply it themselves.
Meanwhile, the decorator remains part of the public api for projects not
using the oslo RequestContext.

I suggest that we'll probably stay with a decorator within oslo, anyway.
It doesn't lend itself well to a Mixin, as we need a reference to a
specific _TransactionContextManager, and moving code directly into
RequestContext would be a very invasive coupling.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.db][nova] NovaObject.save() needs its own DB transaction

2014-11-20 Thread Matthew Booth
 back to the justification for removing that facility
 would help us understand the challenges we face going forward?

I don't know Boris. I can reach out to him if you think we need his
input. However, I think the challenges we have today are fairly clear.

 [1] At a slight tangent, this looks like an artifact of some
 premature generalisation a few years ago. It seems unlikely that
 anybody is going to rewrite the db api using an ORM other than
 sqlalchemy, so we should probably ditch it and promote it to
 db/api.py.
 
 We've had a few people ask about it, in terms of rewriting some or
 all of our DB API to talk to a totally non-SQL backend. Further,
 AFAIK, RAX rewrites a few of the DB API calls to use raw SQL
 queries for performance (or did, at one point).
 
 I'm quite happy to have the implementation of Instance.save() make
 use of primitives to ensure atomicity where appropriate. I don't
 think that's something that needs or deserves generalization at
 this point, and I'm not convinced it needs to be in the save method
 itself. Right now we update several things atomically by passing
 something to db/api that gets turned into properly-related SQLA
 objects. I think we could do the same for any that we're currently
 cascading separately, even if the db/api update method uses a
 transaction to ensure safety.

As I mention above, the problem with this approach is the separation
of concerns between NovaObjects and the db api. E.g. InfoCache
jsonifies its data in save() before sending it to the db. If we move
that logic to db.instance_update_and_get_original(), how do we ensure
the same logic applies when we save the info cache directly? It's
certainly achievable, but it's just adding to the mess. My proposal is
safe, efficient, and simple.

Matt

- -- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iEYEARECAAYFAlRuJP8ACgkQNEHqGdM8NJBgygCfcrZ6PSpeGaJIzkmNKGVIK/Ut
I3YAnjdc1kxONRhJK5ET1HY2kKHFTFT+
=KbI+
-END PGP SIGNATURE-

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


  1   2   >