[openstack-dev] [nova] Would an api option to create an instance without powering on be useful?
I have a request to do $SUBJECT in relation to a V2V workflow. The use case here is conversion of a VM/Physical which was previously powered off. We want to move its data, but we don't want to be powering on stuff which wasn't previously on. This would involve an api change, and a hopefully very small change in drivers to support it. Technically I don't see it as an issue. However, is it a change we'd be willing to accept? Is there any good reason not to do this? Are there any less esoteric workflows which might use this feature? Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][cinder] Disabling nova volume-update (aka swap volume; aka cinder live migration)
On Wed, 22 Aug 2018 at 10:47, Gorka Eguileor wrote: > > On 20/08, Matthew Booth wrote: > > For those who aren't familiar with it, nova's volume-update (also > > called swap volume by nova devs) is the nova part of the > > implementation of cinder's live migration (also called retype). > > Volume-update is essentially an internal cinder<->nova api, but as > > that's not a thing it's also unfortunately exposed to users. Some > > users have found it and are using it, but because it's essentially an > > internal cinder<->nova api it breaks pretty easily if you don't treat > > it like a special snowflake. It looks like we've finally found a way > > it's broken for non-cinder callers that we can't fix, even with a > > dirty hack. > > > > volume-updateessentially does a live copy of the > > data on volume to volume, then seamlessly swaps the > > attachment to from to . The guest OS on > > will not notice anything at all as the hypervisor swaps the storage > > backing an attached volume underneath it. > > > > When called by cinder, as intended, cinder does some post-operation > > cleanup such that is deleted and inherits the same > > volume_id; that is effectively becomes . When called any > > other way, however, this cleanup doesn't happen, which breaks a bunch > > of assumptions. One of these is that a disk's serial number is the > > same as the attached volume_id. Disk serial number, in KVM at least, > > is immutable, so can't be updated during volume-update. This is fine > > if we were called via cinder, because the cinder cleanup means the > > volume_id stays the same. If called any other way, however, they no > > longer match, at least until a hard reboot when it will be reset to > > the new volume_id. It turns out this breaks live migration, but > > probably other things too. We can't think of a workaround. > > > > I wondered why users would want to do this anyway. It turns out that > > sometimes cinder won't let you migrate a volume, but nova > > volume-update doesn't do those checks (as they're specific to cinder > > internals, none of nova's business, and duplicating them would be > > fragile, so we're not adding them!). Specifically we know that cinder > > won't let you migrate a volume with snapshots. There may be other > > reasons. If cinder won't let you migrate your volume, you can still > > move your data by using nova's volume-update, even though you'll end > > up with a new volume on the destination, and a slightly broken > > instance. Apparently the former is a trade-off worth making, but the > > latter has been reported as a bug. > > > > Hi Matt, > > As you know, I'm in favor of making this REST API call only authorized > for Cinder to avoid messing the cloud. > > I know you wanted Cinder to have a solution to do live migrations of > volumes with snapshots, and while this is not possible to do in a > reasonable fashion, I kept thinking about it given your strong feelings > to provide a solution for users that really need this, and I think we > may have a "reasonable" compromise. > > The solution is conceptually simple. We add a new API microversion in > Cinder that adds and optional parameter called "generic_keep_source" > (defaults to False) to both migrate and retype operations. > > This means that if the driver optimized migration cannot do the > migration and the generic migration code is the one doing the migration, > then, instead of our final step being to swap the volume id's and > deleting the source volume, what we would do is to swap the volume id's > and move all the snapshots to reference the new volume. Then we would > create a user message with the new ID of the volume. > > This way we can preserve the old volume with all its snapshots and do > the live migration. > > The implementation is a little bit tricky, as we'll have to add anew > "update_migrated_volume" mechanism to support the renaming of both > volumes, since the old one wouldn't work with this among other things, > but it's doable. > > Unfortunately I don't have the time right now to work on this... Sounds promising, and honestly more than I'd have hoped for. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][cinder] Disabling nova volume-update (aka swap volume; aka cinder live migration)
For those who aren't familiar with it, nova's volume-update (also called swap volume by nova devs) is the nova part of the implementation of cinder's live migration (also called retype). Volume-update is essentially an internal cinder<->nova api, but as that's not a thing it's also unfortunately exposed to users. Some users have found it and are using it, but because it's essentially an internal cinder<->nova api it breaks pretty easily if you don't treat it like a special snowflake. It looks like we've finally found a way it's broken for non-cinder callers that we can't fix, even with a dirty hack. volume-updateessentially does a live copy of the data on volume to volume, then seamlessly swaps the attachment to from to . The guest OS on will not notice anything at all as the hypervisor swaps the storage backing an attached volume underneath it. When called by cinder, as intended, cinder does some post-operation cleanup such that is deleted and inherits the same volume_id; that is effectively becomes . When called any other way, however, this cleanup doesn't happen, which breaks a bunch of assumptions. One of these is that a disk's serial number is the same as the attached volume_id. Disk serial number, in KVM at least, is immutable, so can't be updated during volume-update. This is fine if we were called via cinder, because the cinder cleanup means the volume_id stays the same. If called any other way, however, they no longer match, at least until a hard reboot when it will be reset to the new volume_id. It turns out this breaks live migration, but probably other things too. We can't think of a workaround. I wondered why users would want to do this anyway. It turns out that sometimes cinder won't let you migrate a volume, but nova volume-update doesn't do those checks (as they're specific to cinder internals, none of nova's business, and duplicating them would be fragile, so we're not adding them!). Specifically we know that cinder won't let you migrate a volume with snapshots. There may be other reasons. If cinder won't let you migrate your volume, you can still move your data by using nova's volume-update, even though you'll end up with a new volume on the destination, and a slightly broken instance. Apparently the former is a trade-off worth making, but the latter has been reported as a bug. I'd like to make it very clear that nova's volume-update, isn't expected to work correctly except when called by cinder. Specifically there was a proposal that we disable volume-update from non-cinder callers in some way, possibly by asserting volume state that can only be set by cinder. However, I'm also very aware that users are calling volume-update because it fills a need, and we don't want to trap data that wasn't previously trapped. Firstly, is anybody aware of any other reasons to use nova's volume-update directly? Secondly, is there any reason why we shouldn't just document then you have to delete snapshots before doing a volume migration? Hopefully some cinder folks or operators can chime in to let me know how to back them up or somehow make them independent before doing this, at which point the volume itself should be migratable? If we can establish that there's an acceptable alternative to calling volume-update directly for all use-cases we're aware of, I'm going to propose heading off this class of bug by disabling it for non-cinder callers. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?
On Mon, 13 Aug 2018 at 16:56, Chris Friesen wrote: > > On 08/13/2018 08:26 AM, Jay Pipes wrote: > > On 08/13/2018 10:10 AM, Matthew Booth wrote: > > >> I suspect I've misunderstood, but I was arguing this is an anti-goal. > >> There's no reason to do this if the db is working correctly, and it > >> would violate the principal of least surprise in dbs with legacy > >> datasets (being all current dbs). These values have always been mixed > >> case, lets just leave them be and fix the db. > > > > Do you want case-insensitive keys or do you not want case-insensitive keys? > > > > It seems to me that people complain that MySQL is case-insensitive by > > default > > but actually *like* the concept that a metadata key of "abc" should be > > "equal > > to" a metadata key of "ABC". > > How do we behave on PostgreSQL? (I realize it's unsupported, but it still has > users.) It's case-sensitive by default, do we override that? > > Personally, I've worked on case-sensitive systems long enough that I'd > actually > be surprised if "abc" matched "ABC". :) To the best of my knowledge, the hypothetical PostgreSQL db works exactly how you, me, and pretty much any developer would expect :) Honestly, though, SQLite is probably more interesting as it's at least used for testing. SQLite's default collation is binary, which is obviously case sensitive as you'd expect. As a developer I'm heavily biased in favour of implementing the simplest fix with the simplest and most obvious behaviour, which is to change the default collation to do what everybody expected it did in the first place (which is what Rajesh's patch does). As Jay points out, though, I do concede that those pesky users may be impacted by fixing this bug if they've come to rely on accidental buggy behaviour. The question really comes down to how we can determine what the user impact is for each solution. And here I'm talking about all the various forms of metadata, assuming that whatever solution we picked we'd apply to all. So: - What API calls allow a user to query a thing by metadata key? I believe these API calls would be the only things affected by fixing the collation of metadata keys. If we know what they are we can ask what the impact of changing the behaviour would be. Setting metadata keys isn't subject to regression, as this was previously broken. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?
On Mon, 13 Aug 2018 at 15:27, Jay Pipes wrote: > Do you want case-insensitive keys or do you not want case-insensitive keys? > > It seems to me that people complain that MySQL is case-insensitive by > default but actually *like* the concept that a metadata key of "abc" > should be "equal to" a metadata key of "ABC". > > In other words, it seems to me that users actually expect that: > > > nova aggregate-create agg1 > > nova aggregate-set-metadata agg1 abc=1 > > nova aggregate-set-metadata agg1 ABC=2 > > should result in the original "abc" metadata item getting its value set > to "2". Incidentally, this particular example won't work today: it will just throw an error. I believe the same would apply to user metadata on an instance. IOW this particular example doesn't regress if you fix the bug. The regression would be anything user-facing which queries by metadata key. What does that? Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?
On Mon, 13 Aug 2018 at 15:27, Jay Pipes wrote: > > On 08/13/2018 10:10 AM, Matthew Booth wrote: > > On Mon, 13 Aug 2018 at 14:05, Jay Pipes wrote: > >> > >> On 08/13/2018 06:06 AM, Matthew Booth wrote: > >>> Thanks mriedem for answering my previous question, and also pointing > >>> out the related previous spec around just forcing all metadata to be > >>> lowercase: > >>> > >>> (Spec: approved in Newton) https://review.openstack.org/#/c/311529/ > >>> (Online migration: not merged, abandoned) > >>> https://review.openstack.org/#/c/329737/ > >>> > >>> There are other code patches, but the above is representative. What I > >>> had read was the original bug: > >>> > >>> https://bugs.launchpad.net/nova/+bug/1538011 > >>> > >>> The tl;dr is that the default collation used by MySQL results in a bug > >>> when creating 2 metadata keys which differ only in case. The proposal > >>> was obviously to simply make all metadata keys lower case. However, as > >>> melwitt pointed out in the bug at the time that's a potentially user > >>> hostile change. After some lost IRC discussion it seems that folks > >>> believed at the time that to fix this properly would seriously > >>> compromise the performance of these queries. The agreed way forward > >>> was to allow existing keys to keep their case, but force new keys to > >>> be lower case (so I wonder how the above online migration came > >>> about?). > >>> > >>> Anyway, as Rajesh's patch shows, it's actually very easy just to fix > >>> the MySQL misconfiguration: > >>> > >>> https://review.openstack.org/#/c/504885/ > >>> > >>> So my question is, given that the previous series remains potentially > >>> user hostile, the fix isn't as complex as previously believed, and it > >>> doesn't involve a performance penalty, are there any other reasons why > >>> we might want to resurrect it rather than just go with Rajesh's patch? > >>> Or should we ask Rajesh to expand his patch into a series covering > >>> other metadata? > >> > >> Keep in mind this patch is only related to *aggregate* metadata, AFAICT. > > > > Right, but the original bug pointed out that the same problem applies > > equally to a bunch of different metadata stores. I haven't verified, > > but the provenance was good ;) There would have to be other patches > > for the other metadata stores. > > Yes, it is quite unfortunate that OpenStack has about 15 different ways > of storing metadata key/value information. > > >> > >> Any patch series that tries to "fix" this issue needs to include all of > >> the following: > >> > >> * input automatically lower-cased [1] > >> * inline (note: not online, inline) data migration inside the > >> InstanceMeta object's _from_db_object() method for existing > >> non-lowercased keys > > > > I suspect I've misunderstood, but I was arguing this is an anti-goal. > > There's no reason to do this if the db is working correctly, and it > > would violate the principal of least surprise in dbs with legacy > > datasets (being all current dbs). These values have always been mixed > > case, lets just leave them be and fix the db. > > Do you want case-insensitive keys or do you not want case-insensitive keys? > > It seems to me that people complain that MySQL is case-insensitive by > default but actually *like* the concept that a metadata key of "abc" > should be "equal to" a metadata key of "ABC". > > In other words, it seems to me that users actually expect that: > > > nova aggregate-create agg1 > > nova aggregate-set-metadata agg1 abc=1 > > nova aggregate-set-metadata agg1 ABC=2 > > should result in the original "abc" metadata item getting its value set > to "2". > > If that isn't the case -- and I have a very different impression of what > users *actually* expect from the CLI/UI -- then let me know. I don't know what users want, tbh, I was simply coming from the POV of not breaking the current behaviour. Although I think you're pointing out that either solution breaks the current behaviour: 1. You lower case everything. This breaks users who query user metadata and don't expect keys to be modified. 2. You fix the case sensitivity. This breaks users who add 'Foo' and now expect to query 'foo'. You're saying that although (2) is an artifact of a bug, there could equally be people relying on it. Eurgh. Yeah, that sucks. Objectively though, I think I still like Rajesh's patch better because: * It's vastly simpler to implement correctly and verifiably, and therefore also less prone to future bugs. * It's how it was originally intended to work. * It's simpler to document. Of these, the first is by far the most persuasive. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?
On Mon, 13 Aug 2018 at 14:05, Jay Pipes wrote: > > On 08/13/2018 06:06 AM, Matthew Booth wrote: > > Thanks mriedem for answering my previous question, and also pointing > > out the related previous spec around just forcing all metadata to be > > lowercase: > > > > (Spec: approved in Newton) https://review.openstack.org/#/c/311529/ > > (Online migration: not merged, abandoned) > > https://review.openstack.org/#/c/329737/ > > > > There are other code patches, but the above is representative. What I > > had read was the original bug: > > > > https://bugs.launchpad.net/nova/+bug/1538011 > > > > The tl;dr is that the default collation used by MySQL results in a bug > > when creating 2 metadata keys which differ only in case. The proposal > > was obviously to simply make all metadata keys lower case. However, as > > melwitt pointed out in the bug at the time that's a potentially user > > hostile change. After some lost IRC discussion it seems that folks > > believed at the time that to fix this properly would seriously > > compromise the performance of these queries. The agreed way forward > > was to allow existing keys to keep their case, but force new keys to > > be lower case (so I wonder how the above online migration came > > about?). > > > > Anyway, as Rajesh's patch shows, it's actually very easy just to fix > > the MySQL misconfiguration: > > > > https://review.openstack.org/#/c/504885/ > > > > So my question is, given that the previous series remains potentially > > user hostile, the fix isn't as complex as previously believed, and it > > doesn't involve a performance penalty, are there any other reasons why > > we might want to resurrect it rather than just go with Rajesh's patch? > > Or should we ask Rajesh to expand his patch into a series covering > > other metadata? > > Keep in mind this patch is only related to *aggregate* metadata, AFAICT. Right, but the original bug pointed out that the same problem applies equally to a bunch of different metadata stores. I haven't verified, but the provenance was good ;) There would have to be other patches for the other metadata stores. > > Any patch series that tries to "fix" this issue needs to include all of > the following: > > * input automatically lower-cased [1] > * inline (note: not online, inline) data migration inside the > InstanceMeta object's _from_db_object() method for existing > non-lowercased keys I suspect I've misunderstood, but I was arguing this is an anti-goal. There's no reason to do this if the db is working correctly, and it would violate the principal of least surprise in dbs with legacy datasets (being all current dbs). These values have always been mixed case, lets just leave them be and fix the db. > * change the collation of the aggregate_metadata.key column (note: this > will require an entire rebuild of the table, since this column is part > of a unique constraint [3] Rajesh's patch changes the collation of the table, which I would assume applies to its columns? I assume this is going to be a moderately expensive, but one-off, operation similar in cost to adding a new unique constraint. > * online data migration for migrating non-lowercased keys to their > lowercased counterpars (essentially doing `UPDATE key = LOWER(key) WHERE > LOWER(key) != key` once the collation has been changed) > None of the above touches the API layer. I suppose some might argue that > the REST API should be microversion-bumped since the expected behaviour > of the API will change (data will be transparently changed in one > version of the API and not another). I don't personally think that's > something I would require a microversion for, but who knows what others > may say. Again, I was considering this is an anti-goal. As I understand, Rajesh's patch removes the requirement to make this api change. What did I miss? Thanks, Matt > > Best, > -jay > > [1] > https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L295 > and > https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L331 > and > https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L356 > > > [2] > https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/objects/aggregate.py#L248 > > [3] > https://github.com/openstack/nova/blob/16f89fd093217d22530570e8277b561ea79f46ff/nova/db/sqlalchemy/api_models.py#L64 > > __ > OpenStack Development Mailing List (not for usage questions) > U
[openstack-dev] [nova] Do we still want to lowercase metadata keys?
Thanks mriedem for answering my previous question, and also pointing out the related previous spec around just forcing all metadata to be lowercase: (Spec: approved in Newton) https://review.openstack.org/#/c/311529/ (Online migration: not merged, abandoned) https://review.openstack.org/#/c/329737/ There are other code patches, but the above is representative. What I had read was the original bug: https://bugs.launchpad.net/nova/+bug/1538011 The tl;dr is that the default collation used by MySQL results in a bug when creating 2 metadata keys which differ only in case. The proposal was obviously to simply make all metadata keys lower case. However, as melwitt pointed out in the bug at the time that's a potentially user hostile change. After some lost IRC discussion it seems that folks believed at the time that to fix this properly would seriously compromise the performance of these queries. The agreed way forward was to allow existing keys to keep their case, but force new keys to be lower case (so I wonder how the above online migration came about?). Anyway, as Rajesh's patch shows, it's actually very easy just to fix the MySQL misconfiguration: https://review.openstack.org/#/c/504885/ So my question is, given that the previous series remains potentially user hostile, the fix isn't as complex as previously believed, and it doesn't involve a performance penalty, are there any other reasons why we might want to resurrect it rather than just go with Rajesh's patch? Or should we ask Rajesh to expand his patch into a series covering other metadata? Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] CI job running functional against a mysql DB
I was reviewing https://review.openstack.org/#/c/504885/ . The change looks good to me and I believe the test included exercises the root cause of the problem. However, I'd like to be certain that the test has been executed against MySQL rather than, eg, SQLite. Zuul has voted +1 on the change. Can anybody tell me if any of those jobs ran the included functional test against a MySQL DB?, Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed
On 6 June 2018 at 13:55, Jay Pipes wrote: > On 06/06/2018 07:46 AM, Matthew Booth wrote: >> >> TL;DR I think we need to entirely disable swap volume for multiattach >> volumes, and this will be an api breaking change with no immediate >> workaround. >> >> I was looking through tempest and came across >> >> api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach. >> This test does: >> >> Create 2 multiattach volumes >> Create 2 servers >> Attach volume 1 to both servers >> ** Swap volume 1 for volume 2 on server 1 ** >> Check all is attached as expected >> >> The problem with this is that swap volume is a copy operation. > > > Is it, though? The original blueprint and implementation seem to suggest > that the swap_volume operation was nothing more than changing the mountpoint > for a volume to point to a different location (in a safe > manner that didn't lose any reads or writes). > > https://blueprints.launchpad.net/nova/+spec/volume-swap > > Nothing about the description of swap_volume() in the virt driver interface > mentions swap_volume() being a "copy operation": > > https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476 > >> We don't just replace one volume with another, we copy the contents >> from one to the other and then do the swap. We do this with a qemu >> drive mirror operation, which is able to do this copy safely without >> needing to make the source read-only because it can also track writes >> to the source and ensure the target is updated again. Here's a link >> to the libvirt logs showing a drive mirror operation during the swap >> volume of an execution of the above test: > > After checking the source code, the libvirt virt driver is the only virt > driver that implements swap_volume(), so it looks to me like a public HTTP > API method was added that was specific to libvirt's implementation of drive > mirroring. Yay, more implementation leaking out through the API. > >> >> http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201 >> >> The problem is that when the volume is attached to more than one VM, >> the hypervisor doing the drive mirror *doesn't* know about writes on >> the other attached VMs, so it can't do that copy safely, and the >> result is data corruption. > > > Would it be possible to swap the volume by doing what Vish originally > described in the blueprint: pause the VM, swap the volume mountpoints > (potentially after migrating the underlying volume), start the VM? > >> > Note that swap volume isn't visible to the >> >> guest os, so this can't be addressed by the user. This is a data >> corrupter, and we shouldn't allow it. However, it is in released code >> and users might be doing it already, so disabling it would be a >> user-visible api change with no immediate workaround. > > > I'd love to know who is actually using the swap_volume() functionality, > actually. I'd especially like to know who is using swap_volume() with > multiattach. > >> However, I think we're attempting to do the wrong thing here anyway, >> and the above tempest test is explicit testing behaviour that we don't >> want. The use case for swap volume is that a user needs to move volume >> data for attached volumes, e.g. to new faster/supported/maintained >> hardware. > > > Is that the use case? > > As was typical, there's no mention of a use case on the original blueprint. > It just says "This feature allows a user or administrator to transparently > swap out a cinder volume that connected to an instance." Which is hardly a > use case since it uses the feature name in a description of the feature > itself. :( > > The commit message (there was only a single commit for this functionality > [1]) mentions overwriting data on the new volume: > > Adds support for transparently swapping an attached volume with > another volume. Note that this overwrites all data on the new volume > with data from the old volume. > > Yes, that is the commit message in its entirety. Of course, the commit had > no documentation at all in it, so there's no ability to understand what the > original use case really was here. > > https://review.openstack.org/#/c/28995/ > > If the use case was really "that a user needs to move volume data for > attached volumes", why not just pause the VM, detach the volume, do a > openstack volume migrate to the new destination, reattach the volume and > start the VM? Th
[openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed
TL;DR I think we need to entirely disable swap volume for multiattach volumes, and this will be an api breaking change with no immediate workaround. I was looking through tempest and came across api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach. This test does: Create 2 multiattach volumes Create 2 servers Attach volume 1 to both servers ** Swap volume 1 for volume 2 on server 1 ** Check all is attached as expected The problem with this is that swap volume is a copy operation. We don't just replace one volume with another, we copy the contents from one to the other and then do the swap. We do this with a qemu drive mirror operation, which is able to do this copy safely without needing to make the source read-only because it can also track writes to the source and ensure the target is updated again. Here's a link to the libvirt logs showing a drive mirror operation during the swap volume of an execution of the above test: http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201 The problem is that when the volume is attached to more than one VM, the hypervisor doing the drive mirror *doesn't* know about writes on the other attached VMs, so it can't do that copy safely, and the result is data corruption. Note that swap volume isn't visible to the guest os, so this can't be addressed by the user. This is a data corrupter, and we shouldn't allow it. However, it is in released code and users might be doing it already, so disabling it would be a user-visible api change with no immediate workaround. However, I think we're attempting to do the wrong thing here anyway, and the above tempest test is explicit testing behaviour that we don't want. The use case for swap volume is that a user needs to move volume data for attached volumes, e.g. to new faster/supported/maintained hardware. With single attach that's exactly what they get: the end user should never notice. With multi-attach they don't get that. We're basically forking the shared volume at a point in time, with the instance which did the swap writing to the new location while all others continue writing to the old location. Except that even the fork is broken, because they'll get a corrupt, inconsistent copy rather than point in time. I can't think of a use case for this behaviour, and it certainly doesn't meet the original design intent. What they really want is for the multi-attached volume to be copied from location a to location b and for all attachments to be updated. Unfortunately I don't think we're going to be in a position to do that any time soon, but I also think users will be unhappy if they're no longer able to move data at all because it's multi-attach. We can compromise, though, if we allow a multiattach volume to be moved as long as it only has a single attachment. This means the operator can't move this data without disruption to users, but at least it's not fundamentally immovable. This would require some cooperation with cinder to achieve, as we need to be able to temporarily prevent cinder from allowing new attachments. A natural way to achieve this would be to allow a multi-attach volume with only a single attachment to be redesignated not multiattach, but there might be others. The flow would then be: Detach volume from server 2 Set multiattach=False on volume Migrate volume on server 1 Set multiattach=True on volume Attach volume to server 2 Combined with a patch to nova to disallow swap_volume on any multiattach volume, this would then be possible if inconvenient. Regardless of any other changes, though, I think it's urgent that we disable the ability to swap_volume a multiattach volume because we don't want users to start using this relatively new, but broken, feature. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate
On 19 April 2018 at 16:46, Chris Friesen <chris.frie...@windriver.com> wrote: > On 04/19/2018 08:33 AM, Jay Pipes wrote: >> >> On 04/19/2018 09:15 AM, Matthew Booth wrote: >>> >>> We've had inconsistent naming of recreate/evacuate in Nova for a long >>> time, and it will persist in a couple of places for a while more. >>> However, I've proposed the following to rename 'recreate' to >>> 'evacuate' everywhere with no rpc/api impact here: >>> >>> https://review.openstack.org/560900 >>> >>> One of the things which is renamed is the driver 'supports_recreate' >>> capability, which I've renamed to 'supports_evacuate'. The above >>> change updates this for in-tree drivers, but as noted in review this >>> would impact out-of-tree drivers. If this might affect you, please >>> follow the above in case it merges. >> >> >> I have to admit, Matt, I'm a bit confused by this. I was under the >> impression >> that we were trying to *remove* uses of the term "evacuate" as much as >> possible >> because that term is not adequately descriptive of the operation and terms >> like >> "recreate" were more descriptive? > > > This is a good point. > > Personally I'd prefer to see it go the other way and convert everything to > the "recreate" terminology, including the external API. > > From the CLI perspective, it makes no sense that "nova evacuate" operates > after a host is already down, but "nova evacuate-live" operates on a running > host. A bit OT, but evacuate-live probably shouldn't exist at all for a variety of reasons. The implementation is shonky, it's doing orchestration in the CLI, and the name is misleading, as you say. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate
On 19 April 2018 at 15:33, Jay Pipes <jaypi...@gmail.com> wrote: > On 04/19/2018 09:15 AM, Matthew Booth wrote: >> >> We've had inconsistent naming of recreate/evacuate in Nova for a long >> time, and it will persist in a couple of places for a while more. >> However, I've proposed the following to rename 'recreate' to >> 'evacuate' everywhere with no rpc/api impact here: >> >> https://review.openstack.org/560900 >> >> One of the things which is renamed is the driver 'supports_recreate' >> capability, which I've renamed to 'supports_evacuate'. The above >> change updates this for in-tree drivers, but as noted in review this >> would impact out-of-tree drivers. If this might affect you, please >> follow the above in case it merges. > > > I have to admit, Matt, I'm a bit confused by this. I was under the > impression that we were trying to *remove* uses of the term "evacuate" as > much as possible because that term is not adequately descriptive of the > operation and terms like "recreate" were more descriptive? I'm ambivalent, tbh, but I think it's better to pick one. I thought we'd picked 'evacuate' based on the TODOs from Matt R: http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n2985 http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n3093 Incidentally, this isn't at all core to what I'm working on, but I'm about to start poking it and thought I'd tidy up as I go (as is my wont). If there's discussion to be had I don't mind dropping this and moving on. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate
We've had inconsistent naming of recreate/evacuate in Nova for a long time, and it will persist in a couple of places for a while more. However, I've proposed the following to rename 'recreate' to 'evacuate' everywhere with no rpc/api impact here: https://review.openstack.org/560900 One of the things which is renamed is the driver 'supports_recreate' capability, which I've renamed to 'supports_evacuate'. The above change updates this for in-tree drivers, but as noted in review this would impact out-of-tree drivers. If this might affect you, please follow the above in case it merges. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] z/VM introducing a new config driveformat
On 18 April 2018 at 15:10, Dan Smith <d...@danplanet.com> wrote: >> Thanks for the concern and fully under it , the major reason is >> cloud-init doesn't have a hook or plugin before it start to read >> config drive (ISO disk) z/VM is an old hypervisor and no way to do >> something like libvirt to define a ISO format disk in xml definition, >> instead, it can define disks in the defintion of virtual machine and >> let VM to decide its format. >> >> so we need a way to tell cloud-init where to find ISO file before >> cloud-init start but without AE, we can't handle that...some update on >> the spec here for further information >> https://review.openstack.org/#/c/562154/ > > The ISO format does not come from telling libvirt something about > it. The host creates and formats the image, adds the data, and then > attaches it to the instance. The latter part is the only step that > involves configuring libvirt to attach the image to the instance. The > rest is just stuff done by nova-compute (and the virt driver) on the > linux system it's running on. That's the same arrangement as your > driver, AFAICT. > > You're asking the system to hypervisor (or something running on it) to > grab the image from glance, pre-filled with data. This is no different, > except that the configdrive image comes from the system running the > compute service. I don't see how it's any different in actual hypervisor > mechanics, and thus feel like there _has_ to be a way to do this without > the AE magic agent. Having briefly read the cloud-init snippet which was linked earlier in this thread, the requirement seems to be that the guest exposes the device as /dev/srX or /dev/cdX. So I guess in order to make this work: * You need to tell z/VM to expose the virtual disk as an optical disk * The z/VM kernel needs to call optical disks /dev/srX or /dev/cdX > I agree with Mikal that needing more agent behavior than cloud-init does > a disservice to the users. > > I feel like we get a lot of "but no, my hypervisor is special!" > reasoning when people go to add a driver to nova. So far, I think > they're a lot more similar than people think. Ironic is the weirdest one > we have (IMHO and no offense to the ironic folks) and it can support > configdrive properly. I was going to ask this. Even if the contents of the disk can't be transferred in advance... how does ironic do this? There must be a way. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder][nova] about re-image the volume
On 6 April 2018 at 09:31, Gorka Eguileor <gegui...@redhat.com> wrote: > On 05/04, Matt Riedemann wrote: >> On 4/5/2018 3:15 AM, Gorka Eguileor wrote: >> > But just to be clear, Nova will have to initialize the connection with >> > the re-imagined volume and attach it again to the node, as in all cases >> > (except when defaulting to downloading the image and dd-ing it to the >> > volume) the result will be a new volume in the backend. >> >> Yeah I think I pointed this out earlier in this thread on what I thought the >> steps would be on the nova side with respect to creating a new empty >> attachment to keep the volume 'reserved' while we delete the old attachment, >> re-image the volume, and then update the volume attachment for the new >> connection. I think that would be similar to how shelve and unshelve works >> in nova. >> >> Would this really require a swap volume call from Cinder? I'd hope not since >> swap volume in itself is a pretty gross operation on the nova side. >> >> -- >> >> Thanks, >> >> Matt >> > > Hi Matt, > > Yes, it will require a volume swap, with the worst case scenario > exception where we dd the image into the volume. I think you're talking at cross purposes here: this won't require a swap volume. Apart from anything else, swap volume only works on an attached volume, and as previously discussed Nova will detach and re-attach. Gorka, the Nova api Matt is referring to is called volume update externally. It's the operation required for live migrating an attached volume between backends. It's called swap volume internally in Nova. Matt > > In the same way that anyone would expect a re-imaging preserving the > volume id, one would also expect it to behave like creating a new volume > from the same image: be as fast and take up as much space on the > backend. > > And to do so we have to use existing optimized mechanisms that will only > work when creating a new volume. > > The alternative would be to have the worst case scenario as the default > (attach and dd the image) and make *ALL* Cinder drivers implement the > optimized mechanism where they can efficiently re-imagine a volume. I > can't talk for the Cinder team, but I for one would oppose this > alternative. > > Cheers, > Gorka. > > ______ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] about rebuild instance booted from volume
On 14 March 2018 at 13:46, Matt Riedemann <mriede...@gmail.com> wrote: > On 3/14/2018 3:42 AM, 李杰 wrote: > >> >> This is the spec about rebuild a instance booted from >> volume.In the spec,there is a >>question about if we should delete the old root_volume.Anyone who >> is interested in >>booted from volume can help to review this. Any suggestion is >> welcome.Thank you! >>The link is here. >>Re:the rebuild spec:https://review.openstack.org/#/c/532407/ >> > > Copying the operators list and giving some more context. > > This spec is proposing to add support for rebuild with a new image for > volume-backed servers, which today is just a 400 failure in the API since > the compute doesn't support that scenario. > > With the proposed solution, the backing root volume would be deleted and a > new volume would be created from the new image, similar to how boot from > volume works. > > The question raised in the spec is whether or not nova should delete the > root volume even if its delete_on_termination flag is set to False. The > semantics get a bit weird here since that flag was not meant for this > scenario, it's meant to be used when deleting the server to which the > volume is attached. Rebuilding a server is not deleting it, but we would > need to replace the root volume, so what do we do with the volume we're > replacing? > > Do we say that delete_on_termination only applies to deleting a server and > not rebuild and therefore nova can delete the root volume during a rebuild? > > If we don't delete the volume during rebuild, we could end up leaving a > lot of volumes lying around that the user then has to clean up, otherwise > they'll eventually go over quota. > > We need user (and operator) feedback on this issue and what they would > expect to happen. > My 2c was to overwrite, not delete the volume[1]. I believe this preserves both sets of semantics: the server is rebuilt, and the volume is not deleted. The user will still lose their data, of course, but that's implied by the rebuild they explicitly requested. The volume id will remain the same. [1] I suspect this would require new functionality in cinder to re-initialize from image. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] [Cyborg] Tracking multiple functions
On 2 March 2018 at 14:31, Jay Pipes <jaypi...@gmail.com> wrote: > On 03/02/2018 02:00 PM, Nadathur, Sundar wrote: > >> Hello Nova team, >> >> During the Cyborg discussion at Rocky PTG, we proposed a flow for >> FPGAs wherein the request spec asks for a device type as a resource class, >> and optionally a function (such as encryption) in the extra specs. This >> does not seem to work well for the usage model that I’ll describe below. >> >> An FPGA device may implement more than one function. For example, it may >> implement both compression and encryption. Say a cluster has 10 devices of >> device type X, and each of them is programmed to offer 2 instances of >> function A and 4 instances of function B. More specifically, the device may >> implement 6 PCI functions, with 2 of them tied to function A, and the other >> 4 tied to function B. So, we could have 6 separate instances accessing >> functions on the same device. >> > Does this imply that Cyborg can't reprogram the FPGA at all? > >> In the current flow, the device type X is modeled as a resource class, so >> Placement will count how many of them are in use. A flavor for ‘RC >> device-type-X + function A’ will consume one instance of the RC >> device-type-X. But this is not right because this precludes other >> functions on the same device instance from getting used. >> >> One way to solve this is to declare functions A and B as resource classes >> themselves and have the flavor request the function RC. Placement will then >> correctly count the function instances. However, there is still a problem: >> if the requested function A is not available, Placement will return an >> empty list of RPs, but we need some way to reprogram some device to create >> an instance of function A. >> > > Clearly, nova is not going to be reprogramming devices with an instance of > a particular function. > > Cyborg might need to have a separate agent that listens to the nova > notifications queue and upon seeing an event that indicates a failed build > due to lack of resources, then Cyborg can try and reprogram a device and > then try rebuilding the original request. > It was my understanding from that discussion that we intend to insert Cyborg into the spawn workflow for device configuration in the same way that we currently insert resources provided by Cinder and Neutron. So while Nova won't be reprogramming a device, it will be calling out to Cyborg to reprogram a device, and waiting while that happens. My understanding is (and I concede some areas are a little hazy): * The flavors says device type X with function Y * Placement tells us everywhere with device type X * A weigher orders these by devices which already have an available function Y (where is this metadata stored?) * Nova schedules to host Z * Nova host Z asks cyborg for a local function Y and blocks * Cyborg hopefully returns function Y which is already available * If not, Cyborg reprograms a function Y, then returns it Can anybody correct me/fill in the gaps? Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder][nova] Update attachments on replication failover
Couple of thoughts: Sounds like the work Nova will have to do is identical to volume update (swap volume). i.e. Change where a disk's backing store is without actually changing the disk. Multi-attach! There might be more than 1 instance per volume, and we can't currently support volume update for multi-attached volumes. Matt On 27 February 2018 at 09:45, Matt Riedemann <mriede...@gmail.com> wrote: > On 2/26/2018 9:52 PM, John Griffith wrote: > >> Yeah, it seems like this would be pretty handy with what's there. So >> are folks good with that? Wanted to make sure there's nothing contentious >> there before I propose a spec on the Nova and Cinder sides. If you think it >> seems at least worth proposing I'll work on it and get something ready as a >> welcome home from Dublin gift for everyone :) >> > > I'll put it on the nova/cinder PTG etherpad agenda for Thursday morning. > This seems like simple plumbing on the nova side, so not any major problems > from me. > > -- > > Thanks, > > Matt > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Requesting eyes on fix for bug 1686703
On 31 January 2018 at 16:32, Matt Riedemann <mriede...@gmail.com> wrote: > On 1/31/2018 7:30 AM, Matthew Booth wrote: > >> Could I please have some eyes on this bugfix: >> https://review.openstack.org/#/c/462521/ . I addressed an issue raised >> in August 2017, and it's had no negative feedback since. It would be good >> to get this one finished. >> > > First, I'd like to avoid setting a precedent of asking for reviews in the > ML. So please don't do this. > I don't generally do this, but I think a polite request after 6 months or so is reasonable when something has fallen through the cracks. > Second, this is a latent issue, and we're less than two weeks to RC1, so > I'd prefer that we hold this off until Rocky opens up in case it introduces > any regressions so we at least have time to deal with those when we're not > in stop-ship mode. > That's fine. Looks like I have new feedback to address in the meantime anyway, Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Requesting eyes on fix for bug 1686703
Could I please have some eyes on this bugfix: https://review.openstack.org/#/c/462521/ . I addressed an issue raised in August 2017, and it's had no negative feedback since. It would be good to get this one finished. Thanks, Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova]Nova rescue inject pasword failed
On 29 January 2018 at 09:27, 李杰 <li...@unitedstack.com> wrote: > Hi,all: > I want to access to my instance under rescue state using > temporary password which nova rescue gave me.But this password doesn't > work. Can I ask how this password is injected to instance? I can't find any > specification how is it done.I saw the code about rescue,But it displays > the password has inject. > I use the libvirt as the virt driver. The web said to > set"[libvirt]inject_password=true",but it didn't work. Is it a bug?Can > you give me some advice?Help in troubleshooting this issue will be > appreciated. > Ideally your rescue image will support cloud-init and you would use a config disk. For password injection to work you need inject_password=True, inject_partition=-1 (*NOT* -2, which is the default), and for libguestfs to be correctly installed on your compute hosts. But to reiterate, ideally your rescue image would support cloud-init and you would use a config disk. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Native QEMU LUKS decryption review overview ahead of FF
On 24 January 2018 at 22:57, Matt Riedemann <mriede...@gmail.com> wrote: > On 1/22/2018 8:22 AM, Lee Yarwood wrote: > >> Hello, >> >> With M3 and FF rapidly approaching this week I wanted to post a brief >> overview of the QEMU native LUKS series. >> >> The full series is available on the following topic, I'll go into more >> detail on each of the changes below: >> >> https://review.openstack.org/#/q/topic:bp/libvirt-qemu-nativ >> e-luks+status:open >> >> libvirt: Collocate encryptor and volume driver calls >> https://review.openstack.org/#/c/460243/ (Missing final +2 and +W) >> >> This refactor of the Libvirt driver connect and disconnect volume code >> has the added benefit of also correcting a number of bugs around the >> attaching and detaching of os-brick encryptors. IMHO this would be >> useful in Queens even if the rest of the series doesn't land. >> >> libvirt: Introduce disk encryption config classes >> https://review.openstack.org/#/c/464008/ (Missing final +2 and +W) >> >> This is the most straight forward change of the series and simply >> introduces the required config classes to wire up native LUKS decryption >> within the domain XML of an instance. Hopefully nothing controversial. >> >> libvirt: QEMU native LUKS decryption for encrypted volumes >> https://review.openstack.org/#/c/523958/ (Missing both +2s and +W) >> >> This change carries the bulk of the implementation, wiring up encrypted >> volumes during their initial attachment. The commit message has a >> detailed run down of the various upgrade and LM corner cases we attempt >> to handle here, such as LM from a P to Q compute, detaching a P attached >> encrypted volume after upgrading to Q etc. >> >> Upgrade and LM testing is enabled by the following changes: >> >> fixed_key: Use a single hardcoded key across devstack deployments >> https://review.openstack.org/#/c/536343/ >> >> compute: Introduce an encrypted volume LM test >> https://review.openstack.org/#/c/536177/ >> >> This is being tested by tempest-dsvm-multinode-live-migration and >> grenade-dsvm-neutron-multinode-live-migration in the following DNM Nova >> change, enabling volume backed LM tests: >> >> DNM: Test LM with encrypted volumes >> https://review.openstack.org/#/c/536350/ >> >> Hopefully that covers everything but please feel free to ping if you >> would like more detail, background etc. Thanks in advance, >> >> Lee >> >> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > The patch is already approved, and I asked melwitt to write a release > note, at which point it was noted that swap volume will not work with > native luks encrypted volumes. That's a regression. > It's only a regression since swap_volume with encrypted volumes was fixed in https://review.openstack.org/#/c/460243/, which landed on Monday as part of this series. Prior to Monday, swap_volume with encrypted volumes would result in the raw encrypted volume being presented to the guest after the swap. We need to at least report a nova bug for this so we can work on some kind > of fallback to the non-native decryption until there is a libvirt/qemu fix > upstream and we can put version conditionals in place for when we can > support swap volume with a native luks-encrypted volume. > In the context of the above, I don't think this is a priority as clearly nobody is currently doing it. There's already a bug to track the problem in libvirt, which is linked in a code comment. Admittedly that BZ is unnecessarily private, which I noted in review, but we've reached out to the author to ask them to open it up as there's nothing sensitive going on in there. In general, anything qemu can do natively makes Nova both simpler and more robust because we don't have to modify the host configuration. This eliminates a whole class of error states and race conditions, because when we kill qemu there's nothing left to clean up. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Local disk serial numbers series for reviewers: update 9th Jan
On 9 January 2018 at 15:28, Matthew Booth <mbo...@redhat.com> wrote: > In summary, the patch series is here: > > https://review.openstack.org/#/q/status:open+project:opensta > ck/nova+branch:master+topic:bp/local-disk-serial-numbers > > The bottom 3 patches, which add BDM.uuid have landed. The next 3 currently > have a single +2. Since I last posted I have found and fixed a problem in > swap_volume, which added 2 more patches to the series. There are currently > 13 outstanding patches in the series. > > The following 6 patches are the 'crux' patches. The others in the series > are related fixes/cleanups (mostly renaming things and fixing tests) which > I've moved into separate patches to reduce noise. > > Add DriverLocalImageBlockDevice: > https://review.openstack.org/#/c/526347/6 > This now has a +2! Add local_root to block_device_info: > https://review.openstack.org/#/c/529029/6 > > Pass DriverBlockDevice to driver.attach_volume > https://review.openstack.org/#/c/528363/ > > Expose volume host type and path independent of libvirt config > https://review.openstack.org/#/c/530786/ > > Don't generate fake disk_info in swap_volume > https://review.openstack.org/#/c/530787/ > > Local disk serial numbers for the libvirt driver > https://review.openstack.org/#/c/529380/ > These remain the crux patches. Some of the simpler cleanup patches have also attracted +2s. As I've had to rebase the series due to merge conflicts a couple of times, I've moved these to the bottom so they can potentially go straight in. Cleanup patches with a single +2 already: https://review.openstack.org/#/c/531179/ https://review.openstack.org/#/c/526346/ https://review.openstack.org/#/c/528362/ Thanks, Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Local disk serial numbers series for reviewers: update 9th Jan
In summary, the patch series is here: https://review.openstack.org/#/q/status:open+project: openstack/nova+branch:master+topic:bp/local-disk-serial-numbers The bottom 3 patches, which add BDM.uuid have landed. The next 3 currently have a single +2. Since I last posted I have found and fixed a problem in swap_volume, which added 2 more patches to the series. There are currently 13 outstanding patches in the series. The following 6 patches are the 'crux' patches. The others in the series are related fixes/cleanups (mostly renaming things and fixing tests) which I've moved into separate patches to reduce noise. Add DriverLocalImageBlockDevice: https://review.openstack.org/#/c/526347/6 Add local_root to block_device_info: https://review.openstack.org/#/c/529029/6 Pass DriverBlockDevice to driver.attach_volume https://review.openstack.org/#/c/528363/ Expose volume host type and path independent of libvirt config https://review.openstack.org/#/c/530786/ Don't generate fake disk_info in swap_volume https://review.openstack.org/#/c/530787/ Local disk serial numbers for the libvirt driver https://review.openstack.org/#/c/529380/ Thanks, Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Overview of local disk serial numbers series for reviewers
This series is ready for review. As normal I've done my best to break it up into logically separate changes, so it's currently 15 patches long. I've also done my best to write commit messages with the reviewer in mind. If something looks weird, please check if I called it out in the commit message. This email is an overview of the series to help see all 15 patches in context. The spec is here: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/local-disk-serial-numbers.html The patch series is here: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/local-disk-serial-numbers * Add a uuid to BlockDeviceMapping https://review.openstack.org/242602 Add uuid column to BlockDeviceMapping https://review.openstack.org/242603 Make BlockDeviceMapping object support uuid https://review.openstack.org/525599 Add an online migration for BDM.uuid The first 2 were actually originally Dan Smith's patches from 2015, resurrected for the second time. I've cribbed heavily from the original patches and also the recent migration uuid patches. I've added a bunch of robustification to the potentially racy bits, though, so they differ in some details. We want the libvirt driver to use this uuid as a serial number when creating a guest. We now need to expose it to the libvirt driver for all disks. * Expose BlockDeviceMapping.uuid to drivers https://review.openstack.org/524167 DriverBlockDevice: make subclasses inherit _proxy_as_attr https://review.openstack.org/529037 Expose BDM uuid to drivers Drivers don't consume BlockDeviceMapping directly. Instead, BlockDeviceMapping objects are translated to DriverBlockDevice objects and passed in the block_device_info struct. These 2 changes simply add uuid to existing DriverBlockDevice objects. * Expose the local root disk BDM to drivers https://review.openstack.org/526346 Give volume DriverBlockDevice classes a common prefix https://review.openstack.org/526347 Add DriverLocalImageBlockDevice https://review.openstack.org/529028 Rename block_device_info_get_root https://review.openstack.org/529029 Add local_root to block_device_info Unfortunately, for reasons I'm still not entirely clear on block_device_info has never contained a DriverBlockDevice object representing a local root disk. The patches create a new DriverBlockDevice class for a local root disk, and add it to block_device_info. We do this in such a way that drivers which don't expect it to be there will never notice it. At this point in the series, drivers have access to BDM.uuid for all of an instance's disks. * Fix an array of incorrect uses of DriverBlockDevice and block_device_info https://review.openstack.org/528362 Expose driver_block_device fields as attributes https://review.openstack.org/528363 Pass DriverBlockDevice to driver.attach_volume https://review.openstack.org/527916 Use real block_device_info data in libvirt tests https://review.openstack.org/529328 Fix libvirt volume tests passing invalid disk_info We're going to make changes which require BDM.uuid to be available. Unfortunately we hardcode BDM data all over the place. That's all going to break, even when the change isn't relevant to the test in question, which it isn't 95% of the time. This mini-series focuses on using real data everywhere we can. This reduces a ton of noise later. The specific driver for the second patch in the series (to driver.attach_volume) is that the libvirt driver needs a DriverBlockDevice there, and was creating a fake one in non-test code. At this point in the series all relevant test and non-test code is actually using the new data. This means we can make changes in the libvirt driver which require this new data without breaking everything. * Implement local disk serial numbers for the libvirt driver https://review.openstack.org/529329 Pass disk_info dict to libvirt_info https://review.openstack.org/529380 Local disk serial numbers for the libvirt driver Here we finally make the libvirt driver-specific changes to expose BDM uuid as a serial number for local disks. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Boston Forum session recap - searchlight integration
__ >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Multicell paginated instance lists
Following the forum session yesterday I've thrown up this pseudocode for discussion: https://review.openstack.org/463618 I want to stress that I consider nova proxying to searchlight to be a premature optimisation rather than an intrinsically terrible idea. That said, the searchlight solution looks very complex for both developers and operators, and very fragile. I think we'd better going with a relatively simple solution like this one first, and only going a couple of orders of magnitude more complex if it turns out to be absolutely essential. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Use data loss bug on error during resize (migration datamodel issue)
(Sorry: resent properly tagged) We've encountered a bug in resize which resulted in data loss. The gist is that user was resizing a qcow2 instance whose image had been deleted from glance. In driver.finish_migration on the destination host, an error occurred attempting to copy the image from the source hosts's image cache, putting the instance into an error state. Note that instance.host has been set to the destination host before finish_migration runs. When the image cache cleanup ran on the source host, the instance is no longer in the list of expected instances on that host because instance.host == dest. Image cache manager expired the image from the cache, and there was no other copy of the image. Let's ignore the root cause of the side-loading error, because that's the type of transient error which can always occur. I'm looking for a way to avoid deleting the image from the image cache in future until the resize operation has completed. The obvious way to do this is to update the instance list generated in ComputeManager._run_image_cache_manager_pass to consider not only instances where instance.host is in the node list, but also any instance with a migration record where source/dest is in the node list. The problem with this is that the data model doesn't seem to allow us to fetch the currently active migration. Following the error above, the errors_out_migration decorator on finish_resize has set the migration to an error state. AFAICT this is never deleted, so the presence of a migration in an error state only means that a migration involving this instance has occurred in the past. It doesn't mean that it's currently relevant, so it's basically meaningless. Firstly, have I missed any semantics of the migration record which might allow to me to unambiguously identify currently relevant migrations, whether in an error state or otherwise? That would be ideal, and I'd just go with that. If not, how about adding an active migration field to instance? I don't think it would ever make sense to have more than 1 current migration for a given instance. It would be set back to NULL when the migration was complete, and we'd at least have an opportunity to do something explicit with migrations in an error state. In the meantime I'm going to look for more backportable avenues to fix this. Perhaps not updating instance.host until after finish_migration. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] Use data loss bug on error during resize (migration datamodel issue)
We've encountered a bug in resize which resulted in data loss. The gist is that user was resizing a qcow2 instance whose image had been deleted from glance. In driver.finish_migration on the destination host, an error occurred attempting to copy the image from the source hosts's image cache, putting the instance into an error state. Note that instance.host has been set to the destination host before finish_migration runs. When the image cache cleanup ran on the source host, the instance is no longer in the list of expected instances on that host because instance.host == dest. Image cache manager expired the image from the cache, and there was no other copy of the image. Let's ignore the root cause of the side-loading error, because that's the type of transient error which can always occur. I'm looking for a way to avoid deleting the image from the image cache in future until the resize operation has completed. The obvious way to do this is to update the instance list generated in ComputeManager._run_image_cache_manager_pass to consider not only instances where instance.host is in the node list, but also any instance with a migration record where source/dest is in the node list. The problem with this is that the data model doesn't seem to allow us to fetch the currently active migration. Following the error above, the errors_out_migration decorator on finish_resize has set the migration to an error state. AFAICT this is never deleted, so the presence of a migration in an error state only means that a migration involving this instance has occurred in the past. It doesn't mean that it's currently relevant, so it's basically meaningless. Firstly, have I missed any semantics of the migration record which might allow to me to unambiguously identify currently relevant migrations, whether in an error state or otherwise? That would be ideal, and I'd just go with that. If not, how about adding an active migration field to instance? I don't think it would ever make sense to have more than 1 current migration for a given instance. It would be set back to NULL when the migration was complete, and we'd at least have an opportunity to do something explicit with migrations in an error state. In the meantime I'm going to look for more backportable avenues to fix this. Perhaps not updating instance.host until after finish_migration. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][ceilometer][postgresql][gate][telemetry] PostgreSQL gate failure (again)
On Thu, Feb 2, 2017 at 4:42 PM, Sean Dague <s...@dague.net> wrote: > On 02/02/2017 10:33 AM, Mike Bayer wrote: > > > > > > On 02/01/2017 10:22 AM, Monty Taylor wrote: > >> > >> I personally continue to be of the opinion that without an explicit > >> vocal and well-staffed champion, supporting postgres is more trouble > >> than it is worth. The vast majority of OpenStack deployments are on > >> MySQL - and what's more, the code is written with MySQL in mind. > >> Postgres and MySQL have different trade offs, different things each are > >> good at and different places in which each has weakness. By attempting > >> to support Postgres AND MySQL, we prevent ourselves from focusing > >> adequate attention on making sure that our support for one of them is > >> top-notch and in keeping with best practices for that database. > >> > >> So let me state my opinion slightly differently. I think we should > >> support one and only one RDBMS backend for OpenStack, and we should open > >> ourselves up to use advanced techniques for that backend. I don't > >> actually care whether that DB is MySQL or Postgres - but the corpus of > >> existing deployments on MySQL and the existing gate jobs I think make > >> the choice one way or the other simple. > > > > > > well, let me blow your mind and agree, but noting that this means, *we > > drop SQLite also*. IMO every openstack developer should have > > MySQL/MariaDB running on their machine and that is part of what runs if > > you expect to run database-related unit tests. Targeting just one > > database is very handy but if you really want to use the features > > without roadblocks, you need to go all the way. > > That's all fine and good, we just need to rewrite about 100,000 unit > tests to do that. I'm totally cool with someone taking that task on, but > making a decision about postgresql shouldn't be filibustered on > rewriting all the unit tests in OpenStack because of the ways we use > sqlite. > I wrote a patch series to optionally run all our unit tests using MySQL instead of sqlite a couple of years ago, and it wasn't that hard at the time. The biggest issue I recall was fixing up tests which assumed sub-second timestamp granularity which MySQL did not support at the time (but may now). IIRC the series died because we killed the fixture I was using in oslo.db without replacement before my series finished landing. Fundamentally wasn't that hard, though. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Status of imagebackend refactor patches
I'm continuing to work through updating the imagebackend refactor patches. A bunch of them at the top of the queue already have +2 from jaypipes, who expressed an interest in getting some of these merged. I think the following could be ready to go (in order): https://review.openstack.org/#/c/337159/ libvirt: Rewrite _test_finish_migration [1] https://review.openstack.org/#/c/338993/ libvirt: Test disk creation in test_hard_reboot https://review.openstack.org/#/c/339114/ libvirt: Cleanup test_create_configdrive https://review.openstack.org/#/c/333272/ libvirt: Rename Backend snapshot and image [2] https://review.openstack.org/#/c/331115/ libvirt: Never copy a swap disk during cold migration https://review.openstack.org/#/c/331118/ libvirt: Don't re-resize disks in finish_migration() [3] [1] This has -1 from melwitt, but I don't personally think this issue is worth a respin. However, she mentioned on IRC that there may be more, so that could be moot. Anyway, being the top of the queue this is obviously a blocker. [2] This doesn't have +2 from jaypipes, although it's uncontroversial and uncomplicated. Hopefully just an oversight? [3] This patch removes the function which melwitt identified in [1] lost some test coverage, which is why I'm hoping not to respin for that. I'll leave it there for the moment, because I need to talk through the next patch with ftersin. It looks like it will require a release note at least, if not a change. If anybody would like to slog through some of the above and add a second +2 I'd be very grateful. There's plenty more in the queue after those! Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Follow up on BCN review cadence discussions
I'd like to follow up on the discussions we had in Barcelona around review cadence. I took a lot away from these discussions, and I thought they were extremely productive. I think the summary of the concerns was: Nova is a complex beast, very few people know even most of it well. There are areas of Nova where mistakes are costly and hard to rectify later. A large amount of good code does not merge quickly. The pool of active core reviewers is very much smaller than the total pool of contributors. The barrier to entry for Nova core is very high. There was more, but I think most people agreed on the above. Just before summit I pitched a subsystem maintainer model here: http://lists.openstack.org/pipermail/openstack-dev/2016-September/104093.html Reading over this again, I still think this is worth trialling: it pushes the burden of initial detailed review further out, whilst ensuring that a core will still check that no wider issues have been missed. Bearing in mind that the primary purpose is to reduce the burden on core reviewers of any individual patch, I think this will work best if we aggressively push initial review down as far as possible. I'm still not sure what the best description of 'domain' is, though. Other projects seem to have defined this as: the reviewer should, in good faith, know when they're competent to give a +2 and when they're not. I suspect this is too loose for us, but I'm sure we could come up with something. I'd expect the review burden of a maintainer of an active area of code to be as heavy as a core reviewer's, so this probably shouldn't be taken lightly. If we were to trial it, I'd also recommend starting with a small number of maintainers (about 3, perhaps), preferably technical-leaders of active sub-teams. Any trial should be the topic of a retrospective at the PTG. I'd like to re-iterate that what I'd personally like to achieve is: "Good code should merge quickly." There are likely other ways to achieve this, and if any of them are better than this one then I'm in favour. However, I think we can try this one with limited risk and initial up-front effort. Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Moving storage placement out of the driver
For context, this is a speculative: should we, shouldn't we? The VMware driver currently allows the user to specify what storage their instance will use, I believe using a flavor extra-spec. We've also got significant interest in adding the same to the libvirt driver, but we've got some additional plumbing work to do first to allow us to persist this metadata per-instance, independent of the host configuration. It was my intention to add this logic to the libvirt driver along similar lines. At the same time, we're adding resource providers. It's my layman's understanding that a storage pool (in the abstract, not referring to a particular implementation) will be modelled as a resource provider. A host can, theoretically at least, have arbitrarily many of these, and one may be attached to multiple hosts. As it stands today, ComputeManager specifies what disks an instance should have, and how large they should be. Where to put them is down to the driver. If we're modelling this outside the driver and at least 2 drivers are implementing it, I wonder if we shouldn't be implementing storage policy at a higher level than the driver. Thoughts? Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [libvirt] Debugging blockRebase() - "active block copy not ready for pivot"
On Mon, Oct 10, 2016 at 5:54 PM, Matt Riedemann <mrie...@linux.vnet.ibm.com> wrote: > On 10/10/2016 6:26 AM, Kashyap Chamarthy wrote: > >> >> Also, maybe you have noticed, given Eric's reply on this thread (and the >> upstream libvir-list), it is agreed that virDomainGetBlockJobInfo() >> should be fixed "to quit reporting cur==end when ready:false". This >> allows the existing Nova code work correctly without any changes. >> >> Discusion on Eric's reply in this thread: >> >> http://lists.openstack.org/pipermail/openstack-dev/2016-Octo >> ber/105194.html >> >> And upstream libvir-list >> >> https://www.redhat.com/archives/libvir-list/2016-October/ >> msg00347.html >> >> > Yeah, that's goodness long-term, but OpenStack CI won't benefit from that > for probably at least a couple of years. The hypervisor is a (the?) critical component of any cloud deployment. Objectively, it's bizarre that we expect people to deploy our brand new code to work round things that were fixed in the hypervisor 2 years ago. Given that's where we are, though, I agree. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][libvirt] Lets make libvirt's domain XML canonical
On Fri, Sep 30, 2016 at 4:38 PM, Murray, Paul (HP Cloud) <pmur...@hpe.com> wrote: > > > > > > On 27/09/2016, 18:12, "Daniel P. Berrange" <berra...@redhat.com> wrote: > > >On Tue, Sep 27, 2016 at 10:40:34AM -0600, Chris Friesen wrote: > >> On 09/27/2016 10:17 AM, Matthew Booth wrote: > >> > >> > I think we should be able to create a domain, but once created we > should never > >> > redefine a domain. We can do adding and removing devices dynamically > using > >> > libvirt's apis, secure in the knowledge that libvirt will persist > this for us. > >> > When we upgrade the host, libvirt can ensure we don't break guests > which are on > >> > it. Evacuate should be pretty much the only reason to start again. > >> > >> Sounds interesting. How would you handle live migration? > >> > >> Currently we regenerate the XML file on the destination from the nova > DB. I > >> guess in your proposal we'd need some way of copying the XML file from > the > >> source to the dest, and then modifying the appropriate segments to > adjust > >> things like CPU/NUMA pinning? > > > >Use the flag VIR_MIGRATE_PERSIST_XML and libvirt will write out the > >new persistent XML on the target host automatically. > > We have a problem migrating rescued instances that has a fix in progress > based > on regenerating the xml on unrescue, see: > > https://blueprints.launchpad.net/nova/+spec/live-migrate- > rescued-instances > > That might be another case for generating the xml. > I thought it was a counter-argument (unless I've misunderstood). If you migrate the instance as-is without modification, you don't need to worry about whether it's currently a rescue instance. This problem goes away. The major complication I can think of is things which genuinely must change during a migration, for example cpu pinning. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][cinder] reason(s) for not booting instances from (bootable) volumes
Nothing architectural that I can think of, we just don't support it. Rebuild current takes an image, but I guess it could reasonable take most of the arguments to spawn. We'd have to think hard about what it meant for arguments to be present/omitted, i.e. are we specifying new disks, or redefining old ones, but if the intent could be represented in the api it could certainly be implemented. Matt On Fri, Sep 30, 2016 at 12:32 PM, Tikkanen, Viktor (Nokia - FI/Espoo) < viktor.tikka...@nokia.com> wrote: > Hi! > > I refer here to my question sent earlier to AskOpenstack: > > https://ask.openstack.org/en/question/97425/why-it-is-not- > possible-to-rebuild-an-instance-from-a-bootable-volume/ > > Are there some (architectural, performance, security etc.) reasons why > rebuilding instances from (bootable) volumes is not implemented/supported? > > -Viktor > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][libvirt] Lets make libvirt's domain XML canonical
Currently the libvirt driver (mostly) considers the nova db canonical. That is, we can throw away libvirt's domain XML at any time and recreate it from Nova. Anywhere that doesn't assume this is a bug, because whatever direction we choose we don't need 2 different sources of truth. The thinking behind this is that we should always know what we told libvirt, and if we lose that information then that's a bug. This is true to a degree, and it's the reason I proposed the persistent instance storage metadata spec: we lose track of how we configured an instance's storage. I realised recently that this isn't the whole story, though. Libvirt also automatically creates a bunch of state for us which we didn't specify explicitly. We lose this every time we drop it and recreate. For example, consider device addressing and ordering: $ nova boot ... We tell libvirt to give us a root disk, config disk, and a memballoon device (amongst other things). Libvirt assigns pci addresses to all of these things. $ nova volume-attach ... We tell libvirt to create a new disk attached to the given volume. Libvirt assigns it a pci address. $ nova reboot We throw away libvirt's domain xml and create a new one from scratch. Libvirt assigns new addresses for all of these devices. Before reboot, the device order was: root disk, config disk, memballoon, volume. After reboot the device order is: root disk, volume, config disk, memballoon. Not only have all our devices changed address, which makes Windows sad and paranoid about its licensing, and causes it to offline volumes under certain circumstances, but our disks have been reordered. This isn't all we've thrown away, though. Libvirt also gave us a default machine type. When we create a new domain we'll get a new default machine type. If libvirt has been upgraded, eg during host maintenance, this isn't necessarily what it was before. Again, this can make our guests sad. Same goes for CPU model, default devices, and probably many more things I haven't thought of. Also... we lost the storage configuration of the guest: the information I propose to persist in persistent instance storage metadata. We could store all of this information in Nova, but with the possible exception of storage metadata it really isn't at the level of 'management': it's the minutia of the hypervisor. In order to persist all of these things in Nova we'd have to implement them explicitly, and when libvirt/kvm grows more stuff we'll have to do that too. We'll need to mirror the functionality of libvirt in Nova, feature for feature. This is a red flag for me, and I think it means we should switch to libvirt being canonical. I think we should be able to create a domain, but once created we should never redefine a domain. We can do adding and removing devices dynamically using libvirt's apis, secure in the knowledge that libvirt will persist this for us. When we upgrade the host, libvirt can ensure we don't break guests which are on it. Evacuate should be pretty much the only reason to start again. This would potentially obsolete my persistent instance metadata spec, and the libvirt stable rescue spec, as well as this one: https://review.openstack.org/#/c/347161/ . I raised this in the live migration sub-team meeting, and the immediate response was understandably conservative. I think this solves more problems than it creates, though, and it would result in Nova's libvirt driver getting a bit smaller and a bit simpler. That's a big win in my book. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] TL; DR A proposed subsystem maintainer model
* +A is reserved for cores. * A subsystem maintainer's domain need not be defined by the directory tree: subteams are a good initial model. * A maintainer should only +2 code in their own domain in good faith, enforced socially, not technically. * Subsystem maintainer is expected to request specific additional +2s for patches touching hot areas, eg rpc, db, api. * Hot areas are also good candidate domains for subsystem maintainers. * Hot area review need not cover the whole patch if it's not required: I am +2 on the DB change in this patch. This model means that code with +2 from a maintainer only requires a single +2 from a core. We could implement this incrementally by defining a couple of pilot subsystem maintainer domains. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [NOVA] why it take long time between the unplug_vif and update port host binding while doing live migrate VM
I've been working on a related problem, and have also noticed this delay. These calls happen on separate compute hosts, and lots of things happen in between. Although the goal of my patch isn't to address the issue you describe, I have noticed that this patch significantly reduces it anyway: https://review.openstack.org/#/c/369423/ I'd be interested to know if you see the same in your environment. As an unmerged RPC change, obviously don't go putting this in production! Matt On Thu, Sep 8, 2016 at 10:21 AM, Jingting <kangjingt...@sina.com> wrote: > Hi All: > > As [1], [2] shows, at the last stage in VM migration, it will call > unplug_vifs(), and then migrate_instance_finish(), in which update_port > will be called immediately. > > I think the two steps should be executed without no disturb., but in my > test, there is at least 3 seconds between unplug vif and update_port. > > As I want to do something with my controller at the two points, so I have > tested it many times, and the lasting time remains about 3s. > > Any comment will be appreciate! Thanks! > > > [1] http://paste.openstack.org/show/198298/ > > [2] http://paste.openstack.org/show/568983/ > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra] Reliable way to filter CI in gerrit spam?
On Wed, Aug 31, 2016 at 5:31 PM, Jeremy Stanley <fu...@yuggoth.org> wrote: > On 2016-08-31 16:13:16 +0200 (+0200), Jordan Pittier wrote: > > Most(all?) messages from CI have the lines: > > > > "Patch Set X: > > Build (succeeded|failed)." > > > > Not super robust, but that's a start. > > Also we have naming conventions for third-party CI accounts that > suggest they should end in " CI" so you could match on that. > Yeah, all except 'Jenkins' :) This makes me a bit nervous because all mail still comes from 'rev...@openstack.org', with the name component set to the name of the CI. I was nervous of false positives, so I chose to name them all in full. > > Further, we have configuration in Gerrit to not send E-mail for > comments from accounts in > https://review.openstack.org/#/admin/groups/270,members so if you > are seeing E-mail from Gerrit for third-party CI system comments see > whether they're in that group already (in which case let the Infra > team know because we might have a bug to look into) or ask one of > the members of > https://review.openstack.org/#/admin/groups/440,members to add the > stragglers (or even volunteer to become part of that coordinators > group and help maintain the list). > This sounds interesting. All the CIs I get gerrit spam from are on that list except Jenkins. Do I have to enable something specifically to exclude them? I mashed all the links I could find seemingly related to gerrit settings and I couldn't find anything which looked promising. Thanks again, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [infra] Reliable way to filter CI in gerrit spam?
I've just (re-)written an email filter which splits gerrit emails into those from CI and those from real people. In general I'm almost never interested in botmail, but they comprise about 80% of gerrit email. Having looked carefully through some gerrit emails from real people and CIs, I unfortunately can't find any features which distinguish the CI. Consequently my filter is just a big enumeration of current known CIs. This is kinda ugly, and will obviously get out of date. Is there anything I missed? Or is it possible to unsubscribe from gerrit mail from bots? Or is there any other good way to achieve what I'm looking for which doesn't involve maintaining my own bot list? If not, would it be feasible to add something? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Overview of the libvirt instance storage series
' patches here, in order: * libvirt: Improve mocking of imagebackend disks https://review.openstack.org/#/c/333242/ This provides a single, common way for driver tests to mock imagebackend. There were a few existing approaches to this which had various deficiencies. This new one replaces them all, and allows easy inspection by driver of what went on. The series relies on this interface a lot, which is why I highlight it here. * libvirt: Add create_from_image and create_from_func to Backend https://review.openstack.org/#/c/333244/ This is the requested 'shim' layer. It updates the driver to call the new interfaces, and implements these new interfaces in terms of the old interfaces. Note that we don't make any significant changes to existing tests in this change, which validates that the shim layer continues to call the old interface in the same way as before. The immediately following change updates the tests to use the new interfaces. * libvirt: Add DiskFromImage and DiskFromFunc https://review.openstack.org/#/c/333522/ This is the first part of the change to deconstruct _create_image(). These interfaces allow us to return a 'creatable' disk, which has not actually been created. The caller can then decide whether the disk should be created. Over the next few changes we create some helper methods which return iterators over creatable disks. * libvirt: Don't call _create_image from finish_migration https://review.openstack.org/#/c/337160/ This removes the problematic caller of _create_image. finish_migration doesn't create disks, so supporting it with _create_image and Image.cache() was tortured. The new method is explicit about what needs to happen, and to which disks. * libvirt: Replace _create_images_and_backing in pre_live_migration https://review.openstack.org/#/c/342224/ This is a big change to pre_live_migration. The purpose is to ensure that disks created for live migration are created identically to how they were originally created. * libvirt: Add create_from_image & create_from_func for Qcow2 https://review.openstack.org/#/c/320610/ The native implementation of the new backend interfaces for the Qcow2 backend. I'd like to apologise in advance for dropping this immediately before heading out on holiday for a week. I will be back at work on Monday 1st August, and I obviously won't be able to respond to review feedback before then. In my absence, Diana Clarke may be able to respond. This is a big series, though, so I'd greatly appreciate eyes on all of it, even if some of the earlier patches receive -1s. If we could knock off even some of the earlier test patches, the patch bombs I keep dropping on gerrit will get a bit smaller ;) Thanks, Matt [1] https://blueprints.launchpad.net/nova/+spec/libvirt-instance-storage [2] Flat, Qcow2, Lvm, Rbd, Ploop [3] For recent examples see stable libvirt rescue, and device tagging. -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] libvirt driver: who should create the libvirt.xml file?
On Mon, Jun 20, 2016 at 4:47 PM, Markus Zoeller <mzoel...@linux.vnet.ibm.com > wrote: > White working on the change series to implement the virtlogd feature I > got feedback [1] to move code which creates parts of the libvirt.xml > file from the "driver" module into the "guest" module. I'm a bit > hesitant to do so as the responsibility of creating a valid libvirt.xml > file is then spread across 3 modules: > * driver.py > * guest.py > * designer.py I'm only looking for a guideline here (The "driver" module is humongous > and I think it would be a good thing to have the "libvirt.xml" creation > code outside of it). Thoughts? > I'm also not a huge fan of the division of code for generating libvirt info different backends in imagebackend. I'm not familiar enough with the xml generation code to have a firm opinion on how it should be. My concern looking at [1] is that to the greatest possible extent libvirt should be a data sink, not a data source. That is: Nova is the canonical source of truth. In terms of configuration, there should be nothing that libvirt knows that Nova didn't tell it first, so it should never be necessary to ask for it. The only reason this might not be true would be legacy instances, and we should be fixing them up. So, eg, guest.get_path() shouldn't exist. Matt > > References: > [1] > https://review.openstack.org/#/c/323761/2/nova/virt/libvirt/driver.py@4190 > > -- > Regards, Markus Zoeller (markus_z) > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] A primer on data structures used by Nova to represent block devices
On Thu, Jun 16, 2016 at 4:20 PM, Kashyap Chamarthy <kcham...@redhat.com> wrote: [...] > > BlockDeviceMapping > > === > > > > The 'top level' data structure is the block device mapping object. It is > a > > NovaObject, persisted in the db. Current code creates a BDM object for > > every disk associated with an instance, whether it is a volume or not. I > > can't confirm (or deny) that this has always been the case, though, so > > there may be instances which still exist which have some BDMs missing. > > > > The BDM object describes properties of each disk as specified by the > user. > > It is initially created by the user and passed to compute api. Compute > api > > transforms and consolidates all BDMs to ensure that all disks, explicit > or > > implicit, have a BDM, then persists them. > > What could be an example of an "implicit disk"? > If the flavor defines an ephemeral disk which the user did not specify explicitly, it will be added. Possibly others, I'm not looking at that code right now. > > > Look in nova.objects.block_device > > for all BDM fields, but in essence they contain information like > > (source_type='image', destination_type='local', image_id='), > > or equivalents describing ephemeral disks, swap disks or volumes, and > some > > associated data. > > > > Reader note: BDM objects are typically stored in variables called 'bdm' > > with lists in 'bdms', although this is obviously not guaranteed (and > > unfortunately not always true: bdm in libvirt.block_device is usually a > > DriverBlockDevice object). This is a useful reading aid (except when it's > > proactively confounding), as there is also something else typically > called > > 'block_device_mapping' which is not a BlockDeviceMapping object. > > [...] > > > instance_disk_info > > = > > > > The driver api defines a method get_instance_disk_info, which returns a > > json blob. The compute manager calls this and passes the data over rpc > > between calls without ever looking at it. This is driver-specific opaque > > data. It is also only used by the libvirt driver, despite being part of > the > > api for all drivers. Other drivers do not return any data. The most > > interesting aspect of instance_disk_info is that it is generated from the > > libvirt XML, not from nova's state. > > > > Reader beware: instance_disk_info is often named 'disk_info' in code, > which > > is unfortunate as this clashes with the normal naming of the next > > structure. Occasionally the two are used in the same block of code. > > > > instance_disk_info is a list of dicts for some of an instance's disks. > > The above sentence reads a little awkward (maybe it's just me), might > want to rephrase it if you're submitting it as a Gerrit change. > Yeah. I think that's a case of re-editing followed by inadequate proof reading. > While reading this section, among other places, I was looking at: > _get_instance_disk_info() ("Get the non-volume disk information from the > domain xml") from nova/virt/libvirt/driver.py. > non-volume or Rbd ;) I've become a bit cautious about such docstrings: they aren't always correct :/ > > > Reader beware: Rbd disks (including non-volume disks) and cinder volumes > > are not included in instance_disk_info. > > > > The dicts are: > > > > { > > 'type': libvirt's notion of the disk's type > > 'path': libvirt's notion of the disk's path > > 'virt_disk_size': The disk's virtual size in bytes (the size the > guest > > OS sees) > > 'backing_file': libvirt's notion of the backing file path > > 'disk_size': The file size of path, in bytes. > > 'over_committed_disk_size': As-yet-unallocated disk size, in bytes. > > } > > > > disk_info > > === > > > > Reader beware: as opposed to instance_disk_info, which is frequently > called > > disk_info. > > > > This data structure is actually described pretty well in the comment > block > > at the top of libvirt/blockinfo.py. It is internal to the libvirt driver. > > It contains: > > > > { > > 'disk_bus': the default bus used by disks > > 'cdrom_bus': the default bus used by cdrom drives > > 'mapping': defined below > > } > > > > 'mapping' is a dict which maps disk names to a dict describing how that > > disk should be passed to libvirt. This mapping contains every disk > > connected to the instance, both local and volumes. > > Worth updating exising defintion of 'mapping' in > nova/virt/libvirt/blockinfo.py with your above clearer description > above? > Indubitably. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] A primer on data structures used by Nova to represent block devices
beware: instance_disk_info is often named 'disk_info' in code, which is unfortunate as this clashes with the normal naming of the next structure. Occasionally the two are used in the same block of code. instance_disk_info is a list of dicts for some of an instance's disks. Reader beware: Rbd disks (including non-volume disks) and cinder volumes are not included in instance_disk_info. The dicts are: { 'type': libvirt's notion of the disk's type 'path': libvirt's notion of the disk's path 'virt_disk_size': The disk's virtual size in bytes (the size the guest OS sees) 'backing_file': libvirt's notion of the backing file path 'disk_size': The file size of path, in bytes. 'over_committed_disk_size': As-yet-unallocated disk size, in bytes. } disk_info === Reader beware: as opposed to instance_disk_info, which is frequently called disk_info. This data structure is actually described pretty well in the comment block at the top of libvirt/blockinfo.py. It is internal to the libvirt driver. It contains: { 'disk_bus': the default bus used by disks 'cdrom_bus': the default bus used by cdrom drives 'mapping': defined below } 'mapping' is a dict which maps disk names to a dict describing how that disk should be passed to libvirt. This mapping contains every disk connected to the instance, both local and volumes. First, a note on disk naming. Local disk names used by the libvirt driver are well defined. They are: 'disk': the root disk 'disk.local': the flavor-defined ephemeral disk 'disk.ephX': where X is a zero-based index for bdm-defined ephemeral disks. 'disk.swap': the swap disk 'disk.config': the config disk These names are hardcoded, reliable, and used in lots of places. In disk_info, volumes are keyed by device name, eg 'vda', 'vdb'. Different busses will be named differently, approximately according to legacy Linux device naming. Additionally, disk_info will contain a mapping for 'root', which is the root disk. This will duplicate one of the other entries, either 'disk' or a volume mapping. The information for each disk is: { 'bus': the bus for this disk 'dev': the device name for this disk as known to libvirt 'type': A type from the BlockDeviceType enum ('disk', 'cdrom', 'floppy', 'fs', or 'lun') === keys below are optional, and may not be present 'format': Used to format swap/ephemeral disks before passing to instance (e.g. 'swap', 'ext4') 'boot_index': the 1-based boot index of the disk. } Reader beware: BlockDeviceMapping and DriverBlockDevice store boot index zero-based. However, libvirt's boot index is 1-based, so the value stored here is 1-based. Cleanups === https://review.openstack.org/329366 rename libvirt has_default_ephemeral https://review.openstack.org/329927 Remove virt.block_device._NoLegacy exception https://review.openstack.org/329928 Remove unused context argument to _default_block_device_names() https://review.openstack.org/329930 Rename compute manager _check_dev_name to _add_missing_dev_names -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] theoretical race between live migration and resource audit?
Yes, this is a race. However, it's my understanding that this is 'ok'. The resource tracker doesn't claim to be 100% accurate at all times, right? Otherwise why would it update itself in a period task in the first place. It's my understanding that the resource tracker is basically a best effort cache, and that scheduling decisions can still fail at the host. The resource tracker will fix itself next time it runs via its periodic task. Matt (not a scheduler person) On Thu, Jun 9, 2016 at 10:41 PM, Chris Friesen <chris.frie...@windriver.com> wrote: > Hi, > > I'm wondering if we might have a race between live migration and the > resource audit. I've included a few people on the receiver list that have > worked directly with this code in the past. > > In _update_available_resource() we have code that looks like this: > > instances = objects.InstanceList.get_by_host_and_node() > self._update_usage_from_instances() > migrations = objects.MigrationList.get_in_progress_by_host_and_node() > self._update_usage_from_migrations() > > > In post_live_migration_at_destination() we do this (updating the host and > node as well as the task state): > instance.host = self.host > instance.task_state = None > instance.node = node_name > instance.save(expected_task_state=task_states.MIGRATING) > > > And in _post_live_migration() we update the migration status to > "completed": > if migrate_data and migrate_data.get('migration'): > migrate_data['migration'].status = 'completed' > migrate_data['migration'].save() > > > Both of the latter routines are not serialized by the > COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in > _update_available_resource(). > > > I'm wondering if we can have a situation like this: > > 1) migration in progress > 2) We start running _update_available_resource() on destination, and we > call instances = objects.InstanceList.get_by_host_and_node(). This will > not return the migration, because it is not yet on the destination host. > 3) The migration completes and we call > post_live_migration_at_destination(), which sets the host/node/task_state > on the instance. > 4) In _update_available_resource() on destination, we call migrations = > objects.MigrationList.get_in_progress_by_host_and_node(). This will return > the migration for the instance in question, but when we run > self._update_usage_from_migrations() the uuid will not be in "instances" > and so we will use the instance from the newly-queried migration. We will > then ignore the instance because it is not in a "migrating" state. > > Am I imagining things, or is there a race here? If so, the negative > effects would be that the resources of the migrating instance would be > "lost", allowing a newly-scheduled instance to claim the same resources > (PCI devices, pinned CPUs, etc.) > > Chris > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Openstack-operators] [nova] Is verification of images in the image cache necessary?
On Tue, May 24, 2016 at 1:15 PM, Fichter, Dane G. <dane.fich...@jhuapl.edu> wrote: > Hi John and Matt, > > I actually have a spec and patch up for review addressing some of what > you’re referring to below. > > https://review.openstack.org/#/c/314222/ > https://review.openstack.org/#/c/312210/ > > I think you’re quite right that the existing ImageCacheManager code serves > little purpose. What I propose here is a cryptographically stronger > verification meant to protect against both deliberate modification by an > adversary, as well as accidental sources of disk corruption. If you like, I > can deprecate the checksum-based verification code in the image cache as a > part of this change. Feel free me to email me back or ping me on IRC > (dane-fichter) in order to discuss more. > Thanks Dane, reviewed. I don't think the details are right yet, but I do think this is the way to go. I also think we need to entirely divorce this functionality from the image cache. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Openstack-operators] [nova] Is verification of images in the image cache necessary?
On Tue, May 24, 2016 at 11:06 AM, John Garbutt <j...@johngarbutt.com> wrote: > On 24 May 2016 at 10:16, Matthew Booth <mbo...@redhat.com> wrote: > > During its periodic task, ImageCacheManager does a checksum of every > image > > in the cache. It verifies this checksum against a previously stored > value, > > or creates that value if it doesn't already exist.[1] Based on this > > information it generates a log message if the image is corrupt, but > > otherwise takes no action. Going by git, this has been the case since > 2012. > > > > The commit which added it was associated with 'blueprint > > nova-image-cache-management phase 1'. I can't find this blueprint, but I > did > > find this page: > https://wiki.openstack.org/wiki/Nova-image-cache-management > > . This talks about 'detecting images which are corrupt'. It doesn't > explain > > why we would want to do that, though. It also doesn't seem to have been > > followed through in the last 4 years, suggesting that nobody's really > that > > bothered. > > > > I understand that corruption of bits on disks is a thing, but it's a > thing > > for more than just the image cache. I feel that this is a problem much > > better solved at other layers, prime candidates being the block and > > filesystem layers. There are existing robust solutions to bitrot at both > of > > these layers which would cover all aspects of data corruption, not just > this > > randomly selected slice. > > +1 > > That might mean improved docs on the need to configure such a thing. > > > As it stands, I think this code is regularly running a pretty expensive > task > > looking for something which will very rarely happen, only to generate a > log > > message which nobody is looking for. And it could be solved better in > other > > ways. Would anybody be sad if I deleted it? > > For completeness, we need to deprecate it using the usual cycles: > > https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html I guess I'm arguing that it isn't a feature, and never has been: it really doesn't do anything at all except generate a log message. Are log messages part of the deprecation contract? If operators are genuinely finding corrupt images to be a problem and find this log message useful that would be extremely useful to know. > I like the idea of checking the md5 matches before each boot, as it > mirrors the check we do after downloading from glance. Its possible > thats very unlikely to spot anything that shouldn't already be worried > about by something else. It may just be my love of symmetry that makes > me like that idea? > It just feels arbitrary to me for a few reasons. Firstly, it's only relevant to storage schemes which use the file in the image cache as a backing file. In this libvirt driver, this is just the qcow2 backend. While this is the default, most users are actually using ceph. Assuming it isn't cloning it directly from ceph-backed glance, the Rbd backend imports from the image cache during spawn, and has nothing to do with it thereafter. So for Rbd we'd want to check during spawn. Same for the Flat, Lvm and Ploop backends. Except that it's still arbitrary because we're not checking the Qcow overlay on each boot. Or ephemeral or swap disks. Or Lvm, Flat or Rbd disks at all. Or the operating system. And it's still expensive, and better done by the block or filesystem layer. I'm not personally convinced there's all that much point checking during download either, but given that we're loading all the bits anyway that check is essentially free. However, even if we decided we needed to defend the system against bitrot above the block/filesystem layer (and I'm not at all convinced of that) we'd want a coordinated design for it. Without one, we risk implementing a bunch of disconnected/incomplete stuff that doesn't meet anybody's needs, but burns resources anyway. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Is verification of images in the image cache necessary?
During its periodic task, ImageCacheManager does a checksum of every image in the cache. It verifies this checksum against a previously stored value, or creates that value if it doesn't already exist.[1] Based on this information it generates a log message if the image is corrupt, but otherwise takes no action. Going by git, this has been the case since 2012. The commit which added it was associated with 'blueprint nova-image-cache-management phase 1'. I can't find this blueprint, but I did find this page: https://wiki.openstack.org/wiki/Nova-image-cache-management . This talks about 'detecting images which are corrupt'. It doesn't explain why we would want to do that, though. It also doesn't seem to have been followed through in the last 4 years, suggesting that nobody's really that bothered. I understand that corruption of bits on disks is a thing, but it's a thing for more than just the image cache. I feel that this is a problem much better solved at other layers, prime candidates being the block and filesystem layers. There are existing robust solutions to bitrot at both of these layers which would cover all aspects of data corruption, not just this randomly selected slice. As it stands, I think this code is regularly running a pretty expensive task looking for something which will very rarely happen, only to generate a log message which nobody is looking for. And it could be solved better in other ways. Would anybody be sad if I deleted it? Matt [1] Incidentally, there also seems to be a bug in this implementation, in that it doesn't hold the lock on the image itself at any point during the hashing process, meaning that it cannot guarantee that the image has finished downloading yet. -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][live migration] Request to merge initial storage pool patches
I mentioned in the meeting last Tuesday that there are now 2 of us working on the persistent storage metadata patches: myself and Diana Clarke. I've also been talking to Paul Carlton today trying to work out how he can get moving with the subsequent libvirt storage pools work without a huge amount of conflict. We're going to work that out, but something which would help us enormously while we work together is to knock off some semi-dependent patches at the beginning of the series. Specifically there are 5 patches which precede the series. None of these 5 patches are really part of the series, but the series depends on all 5 to a greater or lesser degree. In order, they are: Only attempt to inject files if the injection disk exists https://review.openstack.org/#/c/250872/ Remove fake_imagebackend.Raw and cleanup dependent tests https://review.openstack.org/#/c/267661/ Rename Image.check_image_exists to Image.exists() https://review.openstack.org/#/c/270998/ Rename Raw backend to NoBacking https://review.openstack.org/#/c/279626/ Remove deprecated option libvirt.remove_unused_kernels https://review.openstack.org/#/c/265886/ These have all been reviewed previously and attracted +1s. I just rebased them, so they may have been lost, but I'm pretty sure they're ready for prime time. These patches aren't really what the series is about, but they are currently the principal source of merge conflicts. If we could get these out of the way it would make it much easier for the 3 of us working on the substantive series. Any chance of some core reviewer attention to get these moving? As for the status of the rest of the series, there are 2 more which I don't expect to change: Add a lock() context manager to image backend https://review.openstack.org/#/c/279625/ Introduce ImageCacheLocalPool https://review.openstack.org/#/c/279669/ Please review these too. However, these patches aren't ready to be merged because they don't add users of the interfaces they introduce. Everything after that is definitely changing. Diana and I are currently working on cranking through these backend by backend. I'll provide a weekly progress update in the live migration meeting. TL;DR Core reviewers: please review the first 5 patches listed above. There will be cake. Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Should be instance_dir in all nova compute node same ?
On Fri, Apr 29, 2016 at 2:47 AM, Eli Qiao <liyong.q...@intel.com> wrote: > hi team, > > Is there any require that all compute node's instance_dir should be same? > Yes. This is assumed in many places, certainly in cold migration/resize. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][live-migration] Libvirt storage pools and persistent disk metadata specs
I've just submitted a new spec for the imagebackend work was doing in Mitaka, which is a prerequisite for the libvirt storage pools work: https://review.openstack.org/#/c/302117/ I was just about to resubmit libvirt storage pools with my spec as a dependency, but re-reading it perhaps we should discuss it briefly first. I'm aware that the driving goal for this move is to enable data transfer over the libvirt pipe. However, while we're messing with the on-disk layout it might be worth thinking about the image cache. The current implementation of the image cache is certainly problematic. It has expanded far beyond its original brief, and it shows. The code has no coherent structure, which makes following it very difficult even to those familiar with it. Worst of all, though, the behaviour of the image cache is distributed across several different modules, with no obvious links between those modules for the unwary (aka tight coupling). The image cache relies on locking for correct behaviour, but this locking is also distributed across many modules. Verifying the correct behaviour of the image cache's locking is hard enough to be impractical, which shows in the persistent stream of bugs we see relating to backing files going missing. In short, the image cache implementation needs an extreme overhaul anyway in the light of its current usage. We also need to address the problem that the image cache doesn't store any metadata about its images. We currently determine the file format of an image cache entry by inspection. While we sanity check images when writing them to the image cache, this is not a robust defence against format bugs and vulnerabilities. More that this, though, the design of the image cache no longer makes sense. When the image cache was originally implemented, there was only local file storage, but the libvirt driver also now supports LVM and ceph. Over 60% of our users use ceph, so ceph is really important. The ceph backend is often able to use images directly from glance if they're also in ceph, but when they're not it continues to use this local file store, which makes no sense. When we move to libvirt storage pools (with my dependent change), we open the possibility for users to have multiple local storage pools, and have instance storage allocated between them according to policy defined in the flavor. If we're using a common image cache for all storage pools, this limits the differentiation between those storage pools. So if a user's paying for instance storage on SSD, but the backing file is on spinning rust, that's not great. Logically, the image cache should be a property of the storage backend. So, local file stores should cache images in the local file store. LVM should cache images in LVM, which would also allow it to use writeable snapshots. Ceph should cache images in the same pool its instance disks are in. This means that the image cache is always directly usable by its storage pool, which is its purpose. The image cache also needs a robust design for operating on shared storage. It currently doesn't have one, although the locking means that it's hopefully probably only maybe slightly a little racy, with any luck, perhaps. This change may be too large to do right now, but we should understand that changing it later will require a further migration of some description, and bear that in mind. A local file/lvm/ceph storage pool with an external image cache has a different implementation and layout to the same storage pool with an integrated image cache. Data is stored in different places, and is managed differently. If we don't do it now, it will be slightly harder later. Reading through the existing spec, I also notice that it mentions use of the 'backingStore' element. This needs to come out of the spec, as we MUST NOT use this. The problem is that it's introspected. I don't know if libvirt stores this persistently while it's running, but it most certainly recreates it after a restart or pool refresh. Introspecting file formats and backing files when the user can influence it is a severe security hole, so we can't do it. Instead, we need to use the metadata created and defined by the spec I linked at the top. There will also need to be a lot more subclassing than this spec anticipates. Specifically, I expect most backends to require a subclass. In particular, the libvirt api doesn't provide storage locking, so we will have to implement that for each backend. I don't want to spend too long on the spec. The only thing worth of discussion is the image cache, I guess. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Migration shared storage negotiation
Timofey has posted a patch which aims to improve shared storage negotiation a little: https://review.openstack.org/#/c/286745/ I've been thinking for some time about going bigger. It occurs to me that the destination node has enough context to know what should be present. I think it would be cleaner for the destination to check what it has, and send the delta back to the source in migration_data. I think that would remove the need for shared storage negotiation entirely. Of course, we'd need to be careful about cleaning up after ourselves, but that's already true. Thoughts? Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all] A proposal to separate the design summit
+1 Matt On Mon, Feb 22, 2016 at 3:31 PM, Monty Taylor <mord...@inaugust.com> wrote: > On 02/22/2016 07:24 AM, Russell Bryant wrote: > >> >> >> On Mon, Feb 22, 2016 at 10:14 AM, Thierry Carrez <thie...@openstack.org >> <mailto:thie...@openstack.org>> wrote: >> >> Hi everyone, >> >> TL;DR: Let's split the events, starting after Barcelona. >> >> >> This proposal sounds fantastic. Thank you very much to those that help >> put it together. >> > > Totally agree. I think it's an excellent way to address the concerns and > balance all of the diverse needs we have. > > Thank you very much! > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] sponsor some LVM development
On Tue, Jan 19, 2016 at 8:47 PM, Fox, Kevin Mwrote: > One feature I think we would like to see that could benefit from LVM is > some kind of multidisk support with better fault tolerance > > For example: > Say you have a node, and there are 20 vm's on it, and thats all the disk > io it could take. But say you have spare cpu/ram capacity other then the > diskio being used up. It would be nice to be able to add a second disk, and > be able to launch 20 more vm's, located on the other disk. > > If you combined them together into one file system (linear append or > raid0), you could loose all 40 vm's if something went wrong. That may be > more then you want to risk. If you could keep them as separate file systems > or logical volumes (maybe with contigous lv's?) Each vm could only top out > a spindle, but it would be much more fault tolerant to failures on the > machine. I can see some cases where that tradeoff between individual vm > performance and number of vm's affected by a device failure can lean in > that direction. > > Thoughts? This is simple enough for a compute host. However, the real problem is in the scheduler. The scheduler needs to be aware that there are multiple distinct resource pools on the host. For e.g., if your 2 pools have 20G of disk each, that doesn't mean you can spin up an instance with 30G. This is the same problem the VMware driver has, and to the best of my knowledge there's still no solution to that. Matt __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] sponsor some LVM development
Hello, Premysl, I'm not working on these features, however I am working in this area of code implementing the libvirt storage pools spec. If anybody does start working on this, please reach out to coordinate as I have a bunch of related patches. My work should also make your features significantly easier to implement. Out of curiosity, can you explain why you want to use LVM specifically over the file-based backends? Matt On Mon, Jan 18, 2016 at 7:49 PM, Premysl Kouril <premysl.kou...@gmail.com> wrote: > Hello everybody, > > we are a Europe based operator and we have a case for LVM based nova > instances in our new cloud infrastructure. We are currently > considering to contribute to OpenStack Nova to implement some features > which are currently not supported for LVM based instances (they are > only supported for raw/qcow2 file based instances). As an example of > such features - nova block live migration or thin provisioning - these > nowadays don't work with LVM based instances (they do work for file > based). > > Before actually diving into development here internally - we wanted to > check on possibility to actually sponsor this development within > existing community. So if there is someone who would be interested in > this work please drop me an email. > > Regards, > Prema > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][stable] Freeze exception for kilo CVE-2015-7548 backports
The following 3 patches fix CVE-2015-7548 Unprivileged api user can access host data using instance snapshot: https://review.openstack.org/#/c/264819/ https://review.openstack.org/#/c/264820/ https://review.openstack.org/#/c/264821/ The OSSA is rated critical. The patches have now landed on master and liberty after some delays in the gate. Given the importance of the fix I suspect that most/all downstream distributions will have already patched (certainly Red Hat has), but it would be good to have them in upstream stable. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] (Live migration track) Libvirt storage pools initial step
I'm off now for the holiday period, so I'm sharing what I'm sitting on. This represents a bit over a week of staring at code, and a couple of days of actual keyboard banging. It doesn't run. Details are in the commit message, including where I think I'm going with this. Read the commit message like a slightly more detailed spec. https://review.openstack.org/259148 I have a reasonable idea of where I want to go with this, but finding a consumable set of steps to get there is challenging. The current code makes lots of assumptions about storage in lots of places. Consequently, adding to it without regressing is becoming increasingly difficult. Making the wholesale change to libvirt storage pools, supporting multiple layouts in the transition period, would require updating these assumptions everywhere they exist in the code. My initial goal is to centralise these assumptions whilst making minimal/no functional changes. Once all the code which assumes on-disk layout is in one place, we can be more confident in making changes to it, and more easily validate it. We can also more easily see what the current assumptions are. This will also make it easier for people to continue adding storage-related features to the libvirt driver. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] What things do we want to get into a python-novaclient 3.0 release?
I wrote this a while back, which implements 'migrate everything off this compute host' in the most robust manner I could come up with using only the external api: https://gist.github.com/mdbooth/163f5fdf47ab45d7addd It obviously overlaps considerably with host-servers-migrate, which is supposed to do the same thing. Users seem to have been appreciative, so I'd be interested to see it merged in some form. Matt On Thu, Nov 19, 2015 at 6:18 PM, Matt Riedemannwrote: > We've been talking about doing a 3.0 release for novaclient for awhile so > we can make some backward incompatible changes, like: > > 1. Removing the novaclient.v1_1 module > 2. Dropping py26 support (if there is any explicit py26 support in there) > > What else are people aware of? > > Monty was talking about doing a thing with auth: > > https://review.openstack.org/#/c/245200/ > > But it sounds like that is not really needed now? > > I'd say let's target mitaka-2 for a 3.0 release and get these flushed out. > > -- > > Thanks, > > Matt Riedemann > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.messaging] State wrapping in the MessageHandlingServer
On Tue, Nov 10, 2015 at 6:46 PM, Joshua Harlow <harlo...@fastmail.com> wrote: > Matthew Booth wrote: > >> My patch to MessageHandlingServer is currently being reverted because it >> broke Nova tests: >> >> https://review.openstack.org/#/c/235347/ >> >> Specifically it causes a number of tests to take a very long time to >> execute, which ultimately results in the total build time limit being >> exceeded. This is very easy to replicate. The >> test >> nova.tests.functional.test_server_group.ServerGroupTest.test_boot_servers_with_affinity >> is an example test which will always hit this issue. The problem is >> that ServerGroupTest.setUp() does: >> >> self.compute2 = self.start_service('compute', host='host2') >> self.addCleanup(self.compute2.kill) >> >> The problem with this is that start_service() adds a fixture which also >> adds kill as a cleanup method. kill does stop(), wait(). This means that >> the resulting call order is: start, stop, wait, stop, wait. The >> redundant call to kill is obviously a wart, but I feel we should have >> handled it anyway. >> >> The problem is that we decided it should be possible to restart a >> server. There are some unit tests in oslo.messaging that do this. It's >> not clear to me that there are any projects which do this, but after >> this experience I feel like it would be good to check before changing it >> :) >> >> The implication of that is that after wait() the state wraps, and we're >> now waiting on start() again. Consequently, when the second cleanup call >> hangs. >> >> We could fix Nova (at least the usage we have seen) by removing the >> wrapping. After wait() if you want to start a server again you need to >> create a new one. >> >> So, to be specific, lets consider the following 2 call sequences: >> >> 1. start stop wait stop wait >> 2. start stop wait start stop wait >> >> What should they do? The behaviours with and without wrapping are: >> >> 1. start stop wait stop wait >> WRAP: start stop wait HANG HANG >> NO WRAP: start stop wait NO-OP NO-OP >> >> 2. start stop wait start stop wait >> WRAP: start stop wait start stop wait >> NO WRAP: start stop wait NO-OP NO-OP NO-OP >> >> I'll refresh my memory on what they did before my change in the morning. >> Perhaps it might be simpler to codify the current behaviour, but iirc I >> proposed this because it was previously undefined due to races. >> > > I personally prefer not allowing restarting, its needless code complexity > imho and a feature that people imho probably aren't using anyway (just > create a new server object if u are doing this), so I'd be fine with doing > the above NO WRAP and turning those into NO-OPs (and for example raising a > runtime error in the case of start stop wait start ... to denote that > restarting isn't recommended/possible). If we have a strong enough reason > to really to start stop wait start ... > > I might be convinced the code complexity is worth it but for now I'm not > convinced... > I agree, and in the hopefully unlikely event that we did break anybody, at least they would get an obvious exception rather than a hang. A lesson from breaking nova was that the log messages were generated and were available in the failed test runs, but nobody noticed them. Incidentally, I think I'd also merge my second patch into the first before resubmitting, which adds timeouts and the option not to log. Matt __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo.messaging] State wrapping in the MessageHandlingServer
My patch to MessageHandlingServer is currently being reverted because it broke Nova tests: https://review.openstack.org/#/c/235347/ Specifically it causes a number of tests to take a very long time to execute, which ultimately results in the total build time limit being exceeded. This is very easy to replicate. The test nova.tests.functional.test_server_group.ServerGroupTest.test_boot_servers_with_affinity is an example test which will always hit this issue. The problem is that ServerGroupTest.setUp() does: self.compute2 = self.start_service('compute', host='host2') self.addCleanup(self.compute2.kill) The problem with this is that start_service() adds a fixture which also adds kill as a cleanup method. kill does stop(), wait(). This means that the resulting call order is: start, stop, wait, stop, wait. The redundant call to kill is obviously a wart, but I feel we should have handled it anyway. The problem is that we decided it should be possible to restart a server. There are some unit tests in oslo.messaging that do this. It's not clear to me that there are any projects which do this, but after this experience I feel like it would be good to check before changing it :) The implication of that is that after wait() the state wraps, and we're now waiting on start() again. Consequently, when the second cleanup call hangs. We could fix Nova (at least the usage we have seen) by removing the wrapping. After wait() if you want to start a server again you need to create a new one. So, to be specific, lets consider the following 2 call sequences: 1. start stop wait stop wait 2. start stop wait start stop wait What should they do? The behaviours with and without wrapping are: 1. start stop wait stop wait WRAP: start stop wait HANG HANG NO WRAP: start stop wait NO-OP NO-OP 2. start stop wait start stop wait WRAP: start stop wait start stop wait NO WRAP: start stop wait NO-OP NO-OP NO-OP I'll refresh my memory on what they did before my change in the morning. Perhaps it might be simpler to codify the current behaviour, but iirc I proposed this because it was previously undefined due to races. Matt __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] Fwd: [nova][mistral] Automatic evacuation as a long running task
Accidentally sent this privately. -- Forwarded message -- From: Matthew Booth <mbo...@redhat.com> Date: Fri, Oct 9, 2015 at 6:14 PM Subject: Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long running task To: "Deja, Dawid" <dawid.d...@intel.com> On Thu, Oct 8, 2015 at 12:51 PM, Deja, Dawid <dawid.d...@intel.com> wrote: > Hi Matthew, > > Thanks for bringing some light on what problems has nova with evacuation > of an instance. It is very important to have those limitations in mind when > preparing final solution. Or to fix them, as you proposed. > > Nevertheless, I would say that evacuationD does more than what calling > 'nova host-evacuate' do. Let's consider such scenario: > > 1. Call 'nova host evacuate HostX' > 2. Caller dies during call - information that some VMs are still to be > evacuated is lost. > No, it's not lost because the instances still have instance.host == source. This means that you can (and must, in fact) simply run 'nova host-evacuate' again if it didn't complete successfully the first time. Note that an external agent (lets call it pacemaker) must solve exactly the same problem in order to send a message to evacuated. It must assure itself that it successfully 'sent a message' at least once, somewhere. Now replace 'sent a message' with 'ran nova host-evacuate'. > Such thing would not happen with evacuationD, because it prepares one > rabbitMQ message for each VM that needs to be evacuated. Moreover, it deals > with situation, when process that lists VMs crashes. In such case, whole > operation would be continued by another daemon. > > EvacD may also handle another problem that you mentioned: failure of > target host of evacuation. In such scenario, 'evacuate host' message will > be send for a new host and EvacD will try to evacuate all of it's vms - > even those in rebuild state. Of course, evacuation of such instances fails, > but they would eventually enter error state and evacuationD would start > resurrection process. This can be speed up by setting instances state to > 'error' (despite these which are in 'active' state) on the beginning of > whole 'evacuate host' process. > Again, this situation is identical to simply running nova host-evacuate. EvacD doesn't do any monitoring, and requires an external agent (we called it pacemaker) to invoke it for the newly failed host. In this scenario, whatever sends 'evacuate host' can instead run 'nova host-evacuate', and the behaviour is identical. > > Finally, another action - called 'Look for VM' - could be added. It would > check if given VM ended up in active state on new hosts; if no, VM could be > rebuild. I hope this would give us as much certainty that VM is alive as > possible. > This would add behaviour over nova host-evacuate. However, it would also be considerably more complex to implement and what's there currently doesn't add any infrastructure which enables it. Remember that nova evacuate is not a heavy operation for the caller. It is literally just a nova api call which returns after kicking off a task in conductor. Running nova host-evacuate does: 1. List all instances 2. For instance 0, tell nova to initiate evac 3 ... Running evacD does: 1. List all instances 2. For instance 0, send ourselves a message to initiate evac ... rabbit ... 3. For instance 0, tell nova to initiate evac In other words, evacD just makes the call chain longer. It adds overhead and additional potential points of failure. Ironically, this means the resulting solution will be less robust. Matt > On Tue, 2015-10-06 at 16:34 +0100, Matthew Booth wrote: > Hi, Roman, > > Evacuated has been on my radar for a while and this post has prodded me to > take a look at the code. I think it's worth starting by explaining the > problems in the current solution. Nova client is currently responsible for > doing this evacuate. It does: > > 1. List all instances on the source host > 2. Initiate evacuate for each instance > > Evacuating a single instance does: > > API: > 1. Set instance task state to rebuilding > 2. Create a migration record with source and dest if specified > > Conductor: > 3. Call the scheduler to get a destination host if not specified > 4. Get the migration object from the db > > Compute: > 5. Rebuild the instance on dest > 6. Update instance.host to dest > > Examining single instance evacuation, the first obvious thing to look at > is what if 2 happen simultaneously. Because step 1 is atomic, it should not > be possible to initiate 2 evacuations simultaneously of a single instance. > However, note that this atomic action hasn't updated the instance host, > meaning the source host remains the owner of this instance. If the > evacuation process fails to complete, the source host wil
Re: [openstack-dev] [all] -1 due to line length violation in commit messages
On Fri, Sep 25, 2015 at 3:44 PM, Ihar Hrachyshkawrote: > Hi all, > > releases are approaching, so it’s the right time to start some bike > shedding on the mailing list. > > Recently I got pointed out several times [1][2] that I violate our commit > message requirement [3] for the message lines that says: "Subsequent lines > should be wrapped at 72 characters.” > > I agree that very long commit message lines can be bad, f.e. if they are > 200+ chars. But <= 79 chars?.. Don’t think so. Especially since we have 79 > chars limit for the code. > > We had a check for the line lengths in openstack-dev/hacking before but it > was killed [4] as per openstack-dev@ discussion [5]. > > I believe commit message lines of <=80 chars are absolutely fine and > should not get -1 treatment. I propose to raise the limit for the guideline > on wiki accordingly. > IIUC, the lower limit for commit messages is because git displays them indented by default, which means that lines which are 80 chars long will wrap on a display which is 80 chars wide. I personally use terminal windows which are 80 chars wide, and I do find long lines in commit messages annoying, so I'm personally in favour of retaining the lower limit. Can't say I'd storm any castles if it was changed, but if most people use git the way I do[1] I guess it should stay. Matt [1] I have no idea if this is the case. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long running task
Hi, Roman, Evacuated has been on my radar for a while and this post has prodded me to take a look at the code. I think it's worth starting by explaining the problems in the current solution. Nova client is currently responsible for doing this evacuate. It does: 1. List all instances on the source host 2. Initiate evacuate for each instance Evacuating a single instance does: API: 1. Set instance task state to rebuilding 2. Create a migration record with source and dest if specified Conductor: 3. Call the scheduler to get a destination host if not specified 4. Get the migration object from the db Compute: 5. Rebuild the instance on dest 6. Update instance.host to dest Examining single instance evacuation, the first obvious thing to look at is what if 2 happen simultaneously. Because step 1 is atomic, it should not be possible to initiate 2 evacuations simultaneously of a single instance. However, note that this atomic action hasn't updated the instance host, meaning the source host remains the owner of this instance. If the evacuation process fails to complete, the source host will automatically delete it if it comes back up because it will find a migration record, but it will not be rebuilt anywhere else. Evacuating it again will fail, because its task state is already rebuilding. Also, let's imagine that the conductor crashes. There is not enough state for any tool, whether internal or external, to be able to know if the rebuild is ongoing somewhere or not, and therefore whether it is safe to retry even if that retry would succeed, which it wouldn't. Which is to say that we can't currently robustly evacuate one instance! Looking at the nova client side, there is an obvious race there: there is no guarantee in step 2 that instances returned in step one have not already been evacuated by another process. We're protected here, though because evacuating a single instance twice will fail the second time. Note that the process isn't idempotent, though, because an evacuation which falls into a hole will never be retried. Moving on to what evacuated does. Evacuated uses rabbit to distribute jobs reliably. There are 2 jobs in evacuated: 1. Evacuate host: 1.1 Get list of all instances on the source host from Nova 1.2 Send an evacuate vm job for each instance 2. Evacuate vm: 2.1 Tell Nova to start evacuating an instance Because we're using rabbit as a reliable message bus, the initiator of one of the tasks knows that it will eventually run to completion at least once. Note that there's nothing to prevent the task being executed more than once per call, though. A task may crash before sending an ack, or may just be really slow. However, in both cases, for exactly the same reasons as for the implementation in nova client, running more than once should not race. It is still not idempotent, though, again for exactly the same reasons as nova client. Also notice that, exactly as in the nova client implementation, we are not asserting that an instance has been evacuated. We are only asserting that we called nova.evacuate, which is to say that we got as far as step 2 in the evacuation sequence above. In other words, in terms of robustness, calling evacuated's evacuate host is identical to asserting that nova client's evacuate host ran to completion at least once, which is quite a lot simpler to do. That's still not very robust, though: we don't recover from failures, and we don't ensure that an instance is evacuated, only that we started an attempt to evacuate at least once. I'm obviously not satisfied with nova client, however as the implementation is simpler I would favour it over evacuated. I believe we can solve this problem, but I think that without fixing single-instance evacuate we're just pushing the problem around (or creating new places for it to live). I would base the robustness of my implementation on a single principal: An instance has a single owner, which is exclusively responsible for rebuilding it. In outline, I would redefine the evacuate process to do: API: 1. Call the scheduler to get a destination for the evacuate if none was given. 2. Atomically update instance.host to this destination, and task state to rebuilding. Compute: 3. Rebuild the instance. This would be supported by a periodic task on the compute host which looks for rebuilding instances assigned to this host which aren't currently rebuilding, and kicks off a rebuild for them. This would cover the compute going down during a rebuild, or the api going down before messaging the compute. Implementing this gives us several things: 1. The list instances, evacuate all instances process becomes idempotent, because as soon as the evacuate is initiated, the instance is removed from the source host. 2. We get automatic recovery of failure of the target compute. Because we atomically moved the instance to the target compute immediately, if the target compute also has to be evacuated, our instance won't fall through the
[openstack-dev] [nova] Fine-grained error reporting via the external API
I've recently been writing a tool which uses Nova's external API. This is my first time consuming this API, so it has involved a certain amount of discovery. The tool is here for the curious: https://gist.github.com/mdbooth/163f5fdf47ab45d7addd I have felt hamstrung by the general inability to distinguish between different types of error. For example, if a live migration failed is it because: 1. The compute driver doesn't support support it. 2. This instance requires block storage migration. 3. Something ephemeral. These 3 errors all require different responses: 1. Quit and don't try again. 2. Try again immediately with the block migration argument.[1] 3. Try again in a bit. However, all I have is that I made a BadRequest. I could potentially grep the human readable error message, but the text of that message doesn't form part of the API, and it may be translated in any case. As an API consumer, it seems I can't really tell anything other than 'it didn't work'. More than that requires guesswork, heuristics and inference. I don't think I've missed some source of additional wisdom, but it would obviously be great if I have. Has there ever been any effort to define some contract around more fine-grained error reporting? Thanks, Matt [1] Incidentally, this suggests to me that live migrate should just do this anyway. -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Fine-grained error reporting via the external API
On 11/09/15 12:19, Sean Dague wrote: > On 09/11/2015 05:41 AM, Matthew Booth wrote: >> I've recently been writing a tool which uses Nova's external API. This >> is my first time consuming this API, so it has involved a certain amount >> of discovery. The tool is here for the curious: >> >> https://gist.github.com/mdbooth/163f5fdf47ab45d7addd >> >> I have felt hamstrung by the general inability to distinguish between >> different types of error. For example, if a live migration failed is it >> because: >> >> 1. The compute driver doesn't support support it. >> >> 2. This instance requires block storage migration. >> >> 3. Something ephemeral. >> >> These 3 errors all require different responses: >> >> 1. Quit and don't try again. >> >> 2. Try again immediately with the block migration argument.[1] >> >> 3. Try again in a bit. >> >> However, all I have is that I made a BadRequest. I could potentially >> grep the human readable error message, but the text of that message >> doesn't form part of the API, and it may be translated in any case. As >> an API consumer, it seems I can't really tell anything other than 'it >> didn't work'. More than that requires guesswork, heuristics and inference. >> >> I don't think I've missed some source of additional wisdom, but it would >> obviously be great if I have. Has there ever been any effort to define >> some contract around more fine-grained error reporting? >> >> Thanks, >> >> Matt >> >> [1] Incidentally, this suggests to me that live migrate should just do >> this anyway. > > This is an API working group recommendation evolving here. The crux of > which is going to be a structured json error return document that will > contain more info. https://review.openstack.org/#/c/167793/ Thanks, Sean, that's exactly what I was looking for. I'll continue this discussion in that review. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [all] Script to merge sort log files
I expect there are several existing solutions to this problem, but here's mine (attached). Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 log_merge.sh Description: application/shellscript __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [all] Any other downstream developers having problems with pbr?
I wrote this: https://review.openstack.org/#/c/195983/1/tools/de-pbr.py,cm Ideally we'd fix PBR, but this seems to be expected behaviour. Thoughts? Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Unvalidated user input passed to functions
I was looking at the migrations api, and I noticed that the api passes the request query unchecked to get_migrations, where it ultimately ends up in a db query. I was curious and spent a couple of hours checking this morning. There are a few instances of this. I didn't find any security bugs, however I feel that this extremely bad practise, and is likely to result in a security bug eventually. For example, note that os-assisted-volume-snapshots:delete does not validate delete_info before passing it to volume_snapshot_delete. I looked at this quite carefully, and I think we are only protected from a host compromise because: 1. The api requires admin context 2. libvirt's security policy I could be wrong on that, though, so perhaps somebody else could check? Passing unvalidated input to a function isn't necessarily bad, for example if it is only used for filtering, but it should be clearly marked as such so it isn't used in an unsafe manner. This marking should follow the data as far as it goes through any number of function calls. libvirt's _volume_snapshot_delete function is a long way from the originating api call, and it is not at all obvious that the commit_base and commit_top arguments to virt_dom.blockCommit() are unvalidated. Does python have anything like perl's taint mode? If so, it might be worth investigating its use. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] PTL Candidacy
On 02/04/15 07:20, Michael Still wrote: I'd like another term as Nova PTL, if you'll have me. ... I think its a good idea also to examine briefly some statistics about specs: Juno: approved but not implemented: 40 implemented: 49 Kilo: approved but not implemented: 30 implemented: 32 Hi, Michael, It has been my impression over at least the last 2 releases that the most significant barrier to progress in Nova is the lack of core reviewer bandwidth. This affects not only feature development, but also the much less sexy paying down of technical debt. There have been various attempts to redefine roles and create new processes, but no attempt that I have seen to address the underlying fundamental issue: the lack of people who can +2 patches. There is a discussion currently ongoing on this list, The Evolution of core developer to maintainer, which contains a variety of proposals. However, none of these will gain anything close to a consensus. The result of this will be that none of them will be implemented. We will be left by default with the status quo, and the situation will continue not to improve despite the new processes we will invent instead. The only way we are going to change the status quo is by fiat. We should of course make every effort to get as many people on board as possible. However, when change is required, but nobody can agree precisely which change, we need positive leadership from the PTL. Would you like to take a position on how to improve core reviewer throughput in the next cycle? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Nova compute will delete all your instances if you change its hostname
Gary Kotton originally posted this bug against the VMware driver: https://bugs.launchpad.net/nova/+bug/1419785 I posted a proposed patch to fix this here: https://review.openstack.org/#/c/158269/1 However, Dan Smith pointed out that the bug can actually be triggered against any driver in a manner not addressed by the above patch alone. I have confirmed this against a libvirt setup as follows: 1. Create some instances 2. Shutdown n-cpu 3. Change hostname 4. Restart n-cpu Nova compute will delete all instances in libvirt, but continue to report them as ACTIVE and Running. There are 2 parts to this issue: 1. _destroy_evacuated_instances() should do a better job of sanity checking before performing such a drastic action. 2. The underlying issue is the definition and use of instance.host, instance.node, compute_node.host and compute_node.hypervisor_hostname. (1) is belt and braces. It's very important, but I want to focus on (2) here. Instantly you'll notice some inconsistent naming here, so to clarify: * instance.host == compute_node.host == Nova compute's 'host' value. * instance.node == compute_node.hypervisor_hostname == an identifier which represents a hypervisor. Architecturally, I'd argue that these mean: * Host: A Nova communication endpoint for a hypervisor. * Hypervisor: The physical location of a VM. Note that in the above case the libvirt driver changed the hypervisor identifier despite the fact that the hypervisor had not changed, only its communication endpoint. I propose the following: * ComputeNode describes 1 hypervisor. * ComputeNode maps 1 hypervisor to 1 compute host. * A ComputeNode is identified by a hypervisor_id. * hypervisor_id represents the physical location of running VMs, independent of a compute host. We've renamed compute_node.hypervisor_hostname to compute_node.hypervisor_id. This resolves some confusion, because it asserts that the identity of the hypervisor is tied to the data describing VMs, not the host which is running it. In fact, for the VMware and Ironic drivers it has never been a hostname. VMware[1] and Ironic don't require any changes here. Other drivers will need to be modified so that get_available_nodes() returns a persistent value rather than just the hostname. A reasonable default implementation of this would be to write a uuid to a file which lives with VM data and return its contents. If the hypervisor has a native concept of a globally unique identifier, that should be used instead. ComputeNode.hypervisor_id is unique. The hypervisor is unique (there is physically only 1 of it) so it does not make sense to have multiple representations of it and its associated resources. An Instance's location is its hypervisor, whereever that may be, so Instance.host could be removed. This isn't strictly necessary, but it is redundant as the communication endpoint is available via ComputeNode. If we wanted to support the possibility of changing a communication endpoint at some point, it would also make that operation trivial. Thinking blue sky, it would also open the future possibility for multiple communication endpoints for a single hypervisor. There is a data migration issue associated with changing a driver's reported hypervisor id. The bug linked below fudges it, but if we were doing it for all drivers I believe it could be handled efficiently by passing the instance list already collected by ComputeManager.init_host to the driver at startup. My proposed patch above fixes a potentially severe issue for users of the VMware and Ironic drivers. In conjunction with a move to a persistent hypervisor id for other drivers, it also fixes the related issue described above across the board. I would like to go forward with my proposed fix as it has an immediate benefit, and I'm happy to work on the persistent hypervisor id for other drivers. Matt [1] Modulo bugs: https://review.openstack.org/#/c/159481/ -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute
On 25/02/15 20:18, Joe Gordon wrote: On Fri, Feb 20, 2015 at 3:48 AM, Matthew Booth mbo...@redhat.com mailto:mbo...@redhat.com wrote: Gary Kotton came across a doozy of a bug recently: https://bugs.launchpad.net/nova/+bug/1419785 In short, when you start a Nova compute, it will query the driver for instances and compare that against the expected host of the the instance according to the DB. If the driver is reporting an instance the DB thinks is on a different host, it assumes the instance was evacuated while Nova compute was down, and deletes it on the hypervisor. However, Gary found that you trigger this when starting up a backup HA node which has a different `host` config setting. i.e. You fail over, and the first thing it does is delete all your instances. Gary and I both agree on a couple of things: 1. Deleting all your instances is bad 2. HA nova compute is highly desirable for some drivers There is a deeper issue here, that we are trying to work around. Nova was never designed to have entire systems running behind a nova-compute. It was designed to have one nova-compute per 'physical box that runs instances' There have been many discussions in the past on how to fix this issue (by adding a new point in nova where clustered systems can plug in), but if I remember correctly the gotcha was no one was willing to step up to do it. There are 2 unrelated concepts of clusters here. The VMware driver has both, which seems to result in some confusion. As it happens, this issue doesn't relate to either of them. Firstly, there's a VMware cluster. This presents itself, and is managed as one, single hypervisor. The only issue Nova has with VMware clusters is in resource tracker, because its resources aren't contiguous. i.e. It's an accounting issue. It would be good to have a solution to this, but it doesn't seem to be causing many real world problems. Secondly there's the concept of 'nodes', whereby 1 nova compute can manage multiple hypervisors. On VMware this means managing multiple clusters, because 1 VMware cluster == 1 hypervisor. Both the Ironic and VMware drivers can do this. This issue relates to the co-location of nova compute with the hypervisor. In the case of both Ironic and VMware, it is not possible to co-locate nova compute with the hypervisor. That means that nova compute must exist separately and be pointed at the hypervisor, which raises the possibility that 2 different nova computes might accidentally be pointed at the same hypervisor. As Gary discovered, this makes bad things happen. Note that no clusters of either kind described above are required to trigger this bug. I have a new patch for this here, btw: https://review.openstack.org/#/c/158269/ . I'd be grateful for more eyes on it. Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute
On 25/02/15 11:51, Radoslav Gerganov wrote: On 02/23/2015 03:18 PM, Matthew Booth wrote: On 23/02/15 12:13, Gary Kotton wrote: On 2/23/15, 2:05 PM, Matthew Booth mbo...@redhat.com wrote: On 20/02/15 11:48, Matthew Booth wrote: Gary Kotton came across a doozy of a bug recently: https://bugs.launchpad.net/nova/+bug/1419785 In short, when you start a Nova compute, it will query the driver for instances and compare that against the expected host of the the instance according to the DB. If the driver is reporting an instance the DB thinks is on a different host, it assumes the instance was evacuated while Nova compute was down, and deletes it on the hypervisor. However, Gary found that you trigger this when starting up a backup HA node which has a different `host` config setting. i.e. You fail over, and the first thing it does is delete all your instances. Gary and I both agree on a couple of things: 1. Deleting all your instances is bad 2. HA nova compute is highly desirable for some drivers We disagree on the approach to fixing it, though. Gary posted this: https://review.openstack.org/#/c/154029/ I've already outlined my objections to this approach elsewhere, but to summarise I think this fixes 1 symptom of a design problem, and leaves the rest untouched. If the value of nova compute's `host` changes, then the assumption that instances associated with that compute can be identified by the value of instance.host becomes invalid. This assumption is pervasive, so it breaks a lot of stuff. The worst one is _destroy_evacuated_instances(), which Gary found, but if you scan nova/compute/manager for the string 'self.host' you'll find lots of them. For example, all the periodic tasks are broken, including image cache management, and the state of ResourceTracker will be unusual. Worse, whenever a new instance is created it will have a different value of instance.host, so instances running on a single hypervisor will become partitioned based on which nova compute was used to create them. In short, the system may appear to function superficially, but it's unsupportable. I had an alternative idea. The current assumption is that the `host` managing a single hypervisor never changes. If we break that assumption, we break Nova, so we could assert it at startup and refuse to start if it's violated. I posted this VMware-specific POC: https://review.openstack.org/#/c/154907/ However, I think I've had a better idea. Nova creates ComputeNode objects for its current configuration at startup which, amongst other things, are a map of host:hypervisor_hostname. We could assert when creating a ComputeNode that hypervisor_hostname is not already associated with a different host, and refuse to start if it is. We would give an appropriate error message explaining that this is a misconfiguration. This would prevent the user from hitting any of the associated problems, including the deletion of all their instances. I have posted a patch implementing the above for review here: https://review.openstack.org/#/c/158269/ I have to look at what you have posted. I think that this topic is something that we should speak about at the summit and this should fall under some BP and well defined spec. I really would not like to see existing installations being broken if and when this patch lands. It may also affect Ironic as it works on the same model. This patch will only affect installations configured with multiple compute hosts for a single hypervisor. These are already broken, so this patch will at least let them know if they haven't already noticed. It won't affect Ironic, because they configure all compute hosts to have the same 'host' value. An Ironic user would only notice this patch if they accidentally misconfigured it, which is the intended behaviour. Incidentally, I also support more focus on the design here. Until we come up with a better design, though, we need to do our best to prevent non-trivial corruption from a trivial misconfiguration. I think we need to merge this, or something like it, now and still have a summit discussion. Matt Hi Matt, I already posted a comment on your patch but I'd like to reiterate here as well. Currently the VMware driver is using the cluster name as hypervisor_hostname which is a problem because you can have different clusters with the same name. We already have a critical bug filed for this: https://bugs.launchpad.net/nova/+bug/1329261 There was an attempt to fix this by using a combination of vCenter UUID + cluster_name but it was rejected because this combination was not considered a 'real' hostname. I think that if we go for a DB schema change we can fix both issues by renaming hypervisor_hostname to hypervisor_id and make it unique. What do you think? Well, I think hypervisor_id makes more sense than hypervisor_hostname. The latter is a confusing. However, I'd prefer
Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute
On 20/02/15 11:48, Matthew Booth wrote: Gary Kotton came across a doozy of a bug recently: https://bugs.launchpad.net/nova/+bug/1419785 In short, when you start a Nova compute, it will query the driver for instances and compare that against the expected host of the the instance according to the DB. If the driver is reporting an instance the DB thinks is on a different host, it assumes the instance was evacuated while Nova compute was down, and deletes it on the hypervisor. However, Gary found that you trigger this when starting up a backup HA node which has a different `host` config setting. i.e. You fail over, and the first thing it does is delete all your instances. Gary and I both agree on a couple of things: 1. Deleting all your instances is bad 2. HA nova compute is highly desirable for some drivers We disagree on the approach to fixing it, though. Gary posted this: https://review.openstack.org/#/c/154029/ I've already outlined my objections to this approach elsewhere, but to summarise I think this fixes 1 symptom of a design problem, and leaves the rest untouched. If the value of nova compute's `host` changes, then the assumption that instances associated with that compute can be identified by the value of instance.host becomes invalid. This assumption is pervasive, so it breaks a lot of stuff. The worst one is _destroy_evacuated_instances(), which Gary found, but if you scan nova/compute/manager for the string 'self.host' you'll find lots of them. For example, all the periodic tasks are broken, including image cache management, and the state of ResourceTracker will be unusual. Worse, whenever a new instance is created it will have a different value of instance.host, so instances running on a single hypervisor will become partitioned based on which nova compute was used to create them. In short, the system may appear to function superficially, but it's unsupportable. I had an alternative idea. The current assumption is that the `host` managing a single hypervisor never changes. If we break that assumption, we break Nova, so we could assert it at startup and refuse to start if it's violated. I posted this VMware-specific POC: https://review.openstack.org/#/c/154907/ However, I think I've had a better idea. Nova creates ComputeNode objects for its current configuration at startup which, amongst other things, are a map of host:hypervisor_hostname. We could assert when creating a ComputeNode that hypervisor_hostname is not already associated with a different host, and refuse to start if it is. We would give an appropriate error message explaining that this is a misconfiguration. This would prevent the user from hitting any of the associated problems, including the deletion of all their instances. I have posted a patch implementing the above for review here: https://review.openstack.org/#/c/158269/ Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute
On 23/02/15 12:13, Gary Kotton wrote: On 2/23/15, 2:05 PM, Matthew Booth mbo...@redhat.com wrote: On 20/02/15 11:48, Matthew Booth wrote: Gary Kotton came across a doozy of a bug recently: https://bugs.launchpad.net/nova/+bug/1419785 In short, when you start a Nova compute, it will query the driver for instances and compare that against the expected host of the the instance according to the DB. If the driver is reporting an instance the DB thinks is on a different host, it assumes the instance was evacuated while Nova compute was down, and deletes it on the hypervisor. However, Gary found that you trigger this when starting up a backup HA node which has a different `host` config setting. i.e. You fail over, and the first thing it does is delete all your instances. Gary and I both agree on a couple of things: 1. Deleting all your instances is bad 2. HA nova compute is highly desirable for some drivers We disagree on the approach to fixing it, though. Gary posted this: https://review.openstack.org/#/c/154029/ I've already outlined my objections to this approach elsewhere, but to summarise I think this fixes 1 symptom of a design problem, and leaves the rest untouched. If the value of nova compute's `host` changes, then the assumption that instances associated with that compute can be identified by the value of instance.host becomes invalid. This assumption is pervasive, so it breaks a lot of stuff. The worst one is _destroy_evacuated_instances(), which Gary found, but if you scan nova/compute/manager for the string 'self.host' you'll find lots of them. For example, all the periodic tasks are broken, including image cache management, and the state of ResourceTracker will be unusual. Worse, whenever a new instance is created it will have a different value of instance.host, so instances running on a single hypervisor will become partitioned based on which nova compute was used to create them. In short, the system may appear to function superficially, but it's unsupportable. I had an alternative idea. The current assumption is that the `host` managing a single hypervisor never changes. If we break that assumption, we break Nova, so we could assert it at startup and refuse to start if it's violated. I posted this VMware-specific POC: https://review.openstack.org/#/c/154907/ However, I think I've had a better idea. Nova creates ComputeNode objects for its current configuration at startup which, amongst other things, are a map of host:hypervisor_hostname. We could assert when creating a ComputeNode that hypervisor_hostname is not already associated with a different host, and refuse to start if it is. We would give an appropriate error message explaining that this is a misconfiguration. This would prevent the user from hitting any of the associated problems, including the deletion of all their instances. I have posted a patch implementing the above for review here: https://review.openstack.org/#/c/158269/ I have to look at what you have posted. I think that this topic is something that we should speak about at the summit and this should fall under some BP and well defined spec. I really would not like to see existing installations being broken if and when this patch lands. It may also affect Ironic as it works on the same model. This patch will only affect installations configured with multiple compute hosts for a single hypervisor. These are already broken, so this patch will at least let them know if they haven't already noticed. It won't affect Ironic, because they configure all compute hosts to have the same 'host' value. An Ironic user would only notice this patch if they accidentally misconfigured it, which is the intended behaviour. Incidentally, I also support more focus on the design here. Until we come up with a better design, though, we need to do our best to prevent non-trivial corruption from a trivial misconfiguration. I think we need to merge this, or something like it, now and still have a summit discussion. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope
On 20/02/15 20:15, Andrew Bogott wrote: On 2/20/15 9:06 AM, Mike Dorman wrote: I can report that we do use this option (‘global' setting.) We have to enforce name uniqueness for instances’ integration with some external systems (namely AD and Spacewalk) which require unique naming. However, we also do some external name validation which I think effectively enforces uniqueness as well. So if this were deprecated, I don’t know if it would directly affect us for our specific situation. Other operators, anyone else using osapi_compute_unique_server_name_scope? I use it! And, in fact, added it in the first place :( I have no real recall of what we concluded when originally discussing the associated race. The feature is useful to me and I'd love it if it could be moved into the db to fix the race, but I concede that it's a pretty big can o' worms, so if no one else cries out in pain I can live with it being deprecated. Ok, with 2 identified users I'm going to abandon my patch to deprecate with no replacement. My current plan is to leave the race there with a comment in the code and the config documentation. Perhaps we can fix this properly at some point when we get the online schema changes, but for the moment it seems like a lot of complication for a relatively small problem. Do you use the global or project scope, btw? Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][vmware][ironic] Configuring active/passive HA Nova compute
Gary Kotton came across a doozy of a bug recently: https://bugs.launchpad.net/nova/+bug/1419785 In short, when you start a Nova compute, it will query the driver for instances and compare that against the expected host of the the instance according to the DB. If the driver is reporting an instance the DB thinks is on a different host, it assumes the instance was evacuated while Nova compute was down, and deletes it on the hypervisor. However, Gary found that you trigger this when starting up a backup HA node which has a different `host` config setting. i.e. You fail over, and the first thing it does is delete all your instances. Gary and I both agree on a couple of things: 1. Deleting all your instances is bad 2. HA nova compute is highly desirable for some drivers We disagree on the approach to fixing it, though. Gary posted this: https://review.openstack.org/#/c/154029/ I've already outlined my objections to this approach elsewhere, but to summarise I think this fixes 1 symptom of a design problem, and leaves the rest untouched. If the value of nova compute's `host` changes, then the assumption that instances associated with that compute can be identified by the value of instance.host becomes invalid. This assumption is pervasive, so it breaks a lot of stuff. The worst one is _destroy_evacuated_instances(), which Gary found, but if you scan nova/compute/manager for the string 'self.host' you'll find lots of them. For example, all the periodic tasks are broken, including image cache management, and the state of ResourceTracker will be unusual. Worse, whenever a new instance is created it will have a different value of instance.host, so instances running on a single hypervisor will become partitioned based on which nova compute was used to create them. In short, the system may appear to function superficially, but it's unsupportable. I had an alternative idea. The current assumption is that the `host` managing a single hypervisor never changes. If we break that assumption, we break Nova, so we could assert it at startup and refuse to start if it's violated. I posted this VMware-specific POC: https://review.openstack.org/#/c/154907/ However, I think I've had a better idea. Nova creates ComputeNode objects for its current configuration at startup which, amongst other things, are a map of host:hypervisor_hostname. We could assert when creating a ComputeNode that hypervisor_hostname is not already associated with a different host, and refuse to start if it is. We would give an appropriate error message explaining that this is a misconfiguration. This would prevent the user from hitting any of the associated problems, including the deletion of all their instances. We can still do active/passive HA! If we configure both nodes in the active/passive cluster identically, including with the same value of `host`, I don't see why this shouldn't work today. I don't even think the configuration is onerous. All we would be doing is preventing the user from accidentally running a misconfigured HA which leads to inconsistent state, and will eventually require manual cleanup. We would still have to be careful that we don't bring up both nova computes simultaneously. The VMware driver, at least, has hardcoded assumptions that it is the only writer in certain circumstances. That problem would have to be handled separately, perhaps at the messaging layer. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope
On 19/02/15 18:57, Jay Pipes wrote: On 02/19/2015 05:18 AM, Matthew Booth wrote: Nova contains a config variable osapi_compute_unique_server_name_scope. Its help text describes it pretty well: When set, compute API will consider duplicate hostnames invalid within the specified scope, regardless of case. Should be empty, project or global. So, by default hostnames are not unique, but depending on this setting they could be unique either globally or in the scope of a project. Ideally a unique constraint would be enforced by the database but, presumably because this is a config variable, that isn't the case here. Instead it is enforced in code, but the code which does this predictably races. My first attempt to fix this using the obvious SQL solution appeared to work, but actually fails in MySQL as it doesn't support that query structure[1][2]. SQLite and PostgreSQL do support it, but they don't support the query structure which MySQL supports. Note that this isn't just a syntactic thing. It looks like it's still possible to do this if we compound the workaround with a second workaround, but I'm starting to wonder if we'd be better fixing the design. First off, do we need this config variable? Is anybody actually using it? I suspect the answer's going to be yes, but it would be extremely convenient if it's not. Assuming this configurability is required, is there any way we can instead use it to control a unique constraint in the db at service startup? This would be something akin to a db migration. How do we manage those? Related to the above, I'm not 100% clear on which services run this code. Is it possible for different services to have a different configuration of this variable, and does that make sense? If so, that would preclude a unique constraint in the db. Is there a bug that you are attempting to fix here? If not, I'd suggest just leaving this code as it is... The bug is the race. If a user sets osapi_compute_unique_server_name_scope they're presumably expecting the associating uniqueness guarantees, but we're not providing them. John suggested I deprecate it and see who complains: https://review.openstack.org/157731 In the mean time, I'm starting to think the most prudent course of action would be to not fix the race: it's not the most important race in that function, it's been broken for a long time, and I suspect few people are using it. We could document that it's broken... In fact, I might add that to the deprecation notice. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] The strange case of osapi_compute_unique_server_name_scope
Nova contains a config variable osapi_compute_unique_server_name_scope. Its help text describes it pretty well: When set, compute API will consider duplicate hostnames invalid within the specified scope, regardless of case. Should be empty, project or global. So, by default hostnames are not unique, but depending on this setting they could be unique either globally or in the scope of a project. Ideally a unique constraint would be enforced by the database but, presumably because this is a config variable, that isn't the case here. Instead it is enforced in code, but the code which does this predictably races. My first attempt to fix this using the obvious SQL solution appeared to work, but actually fails in MySQL as it doesn't support that query structure[1][2]. SQLite and PostgreSQL do support it, but they don't support the query structure which MySQL supports. Note that this isn't just a syntactic thing. It looks like it's still possible to do this if we compound the workaround with a second workaround, but I'm starting to wonder if we'd be better fixing the design. First off, do we need this config variable? Is anybody actually using it? I suspect the answer's going to be yes, but it would be extremely convenient if it's not. Assuming this configurability is required, is there any way we can instead use it to control a unique constraint in the db at service startup? This would be something akin to a db migration. How do we manage those? Related to the above, I'm not 100% clear on which services run this code. Is it possible for different services to have a different configuration of this variable, and does that make sense? If so, that would preclude a unique constraint in the db. Thanks, Matt [1] Which has prompted me to get the test_db_api tests running on MySQL. See this series if you're interested: https://review.openstack.org/#/c/156299/ [2] For specifics, see my ramblings here: https://review.openstack.org/#/c/141115/7/nova/db/sqlalchemy/api.py,cm line 2547 -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Libguestfs: possibility not to use it, even when installed ?
On 18/02/15 18:23, Raphael Glon wrote: Hi, This is about review: https://review.openstack.org/#/c/156633/ 1 line, can be controversial Its purpose is to add the possibility not to use libguestfs for data injection in nova, even when installed. Not discussing about the fact that libguestfs should be preferred over fuse mounts for data injection as much as possible because mounts are more subject to causing security issues (and already have in the past nova releases). However, there are a lot of potential cases when libguestfs won't be usable for data injection This was the case here (fixed): https://bugzilla.redhat.com/show_bug.cgi?id=984409 I entcountered a similar case more recently on powerkvm 2.1.0 (defect with the libguestfs) So just saying it could be good adding a simple config flag (set to True by default, to keep the current behaviour untouched) to force nova not using libguestfs without having to uninstall it and thus prevent other users on the host from using it. A big -1 to this. You seem to be saying that there were bugs in a library, which were fixed. News at 11. This isn't a realistic way to manage a large software stack. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova][vmware] HA and image caching
I was discussing the problems of configuring active/passive HA earlier where the active and passive nodes have a different `host` configured. There are lots of problems, as I've mentioned before, but Gary pointed out that it probably affects image cache management, too. This had the potential for unsubtle breakage, so I decided to check it out in a bit more detail. The bug is real, but fortunately doesn't delete anything (which is itself a bug): storage_users.register_storage_use(CONF.instances_path, CONF.host) nodes = storage_users.get_storage_users(CONF.instances_path) This bit of code assumes shared storage between different nova nodes which access the same images. We don't have that, so `nodes` above is going to contain only the current node. filters = {'deleted': False, 'soft_deleted': True, 'host': nodes} filtered_instances = objects.InstanceList.get_by_filters(context, filters, expected_attrs=[], use_slave=True) We're getting a list of instances which were created by the current node only. self.driver.manage_image_cache(context, filtered_instances) Which we're passing to the driver. So, the driver is only going to consider images attached to instances which were created by the current node. The effect of this is that just after a switchover, cache management will not consider any instances, and therefore nothing will be expired. This is still bad, but it's not going to delete users' stuff. Incidentally, I remain convinced that we can safely configure active/passive HA if we configure all nodes with the same `host`. That way we wouldn't have to worry about all these edge cases, and we wouldn't need to eventually create a cleanup job to consolidate all instances running on a single hypervisor to have the same 'host'. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames
On 11/02/15 15:49, Gary Kotton wrote: Hi, I do not think that that is a healthy solution. That effectively would render a cluster down if the compute node goes down. That would be a real disaster. The ugly work around is setting the host names to be the same value. I don't think that's an ugly work around. I think that's the only currently viable solution. This is something that we should discuss at the next summit and I would hope to propose a topic to talk about. Sounds like a good plan. However, given that the bug is marked Critical I was assuming we wanted a more expedient fix, which is what I've proposed. Matt Thanks Gary On 2/11/15, 5:31 PM, Matthew Booth mbo...@redhat.com wrote: I just posted this: https://review.openstack.org/#/c/154907/ as an alternative fix for critical bug: https://bugs.launchpad.net/nova/+bug/1419785 I've just knocked this up quickly for illustration: it obviously needs plenty of cleanup. I have confirmed that it works, though. Before I take it any further, though, I'd like to get some feedback on the approach. I prefer this to the alternative, because the underlying problem is deeper than supporting evacuate. I'd prefer to be honest with the user and just say it ain't gonna work. The alternative would leave Nova running in a broken state, leaving inconsistent state in its wake as it runs. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames
On 11/02/15 16:40, Gary Kotton wrote: On 2/11/15, 6:35 PM, Sylvain Bauza sba...@redhat.com wrote: Le 11/02/2015 17:04, Gary Kotton a écrit : I posted a fix that does not break things and supports HA. https://review.openstack.org/154029 Just let's be clear, HA is *not* supported by Nova now. That is not correct. It is actually support if the host_ip is the same. If the host_ip is not the same then there is an issue when one of the compute nodes restarts - it will delete all instances that do not have its host_ip. FWIW, I suspect you're correct. My patch enforces that. So, check out the init code from compute manager: def init_host(self): Initialization for a standalone compute service. self.driver.init_host(host=self.host) The driver is being configured with the Compute service's 'host' config variable. Let's scan through and look what else uses that: def _update_resource_tracker(self, context, instance): Let the resource tracker know that an instance has changed state. if (instance['host'] == self.host and self.driver.node_is_available(instance['node'])): Instances with the old host value will be ignored by _update_resource_tracker. That doesn't sound good. There's _destroy_evacuated_instances(), which you already found :) There's this in _validate_instance_group_policy: group_hosts = group.get_hosts(context, exclude=[instance.uuid]) if self.host in group_hosts: msg = _(Anti-affinity instance group policy was violated.) You've changed group affinity. There's _check_instance_build_time: filters = {'vm_state': vm_states.BUILDING, 'host': self.host} Nova won't check instances created by the old host to see if they're stuck. There's this in _heal_instance_info_cache: db_instances = objects.InstanceList.get_by_host( context, self.host, expected_attrs=[], use_slave=True) This cleanup job won't find instances created by the old host. There's this in _poll_rebooting_instances: filters = {'task_state': [task_states.REBOOTING, task_states.REBOOT_STARTED, task_states.REBOOT_PENDING], 'host': self.host} Nova won't poll instances created by the old host. This is just a cursory flick through. I'm fairly sure this is going to be a lot of work to fix. My patch just ensures that Nova refuses to start instead of letting bad things happen. If you ensure that 'self.host' in the above code is the same for all HA nodes I don't see why it shouldn't work, though. My patch won't prevent that. Matt The main reason is that compute *nodes* are considered given by the hypervisor (ie. the virt driver ran by the compute manager worker), so if 2 or more hypervisors on two distinct machines are getting the same list of nodes, then you would have duplicates. No. There are no duplicates. -Sylvain On 2/11/15, 5:55 PM, Matthew Booth mbo...@redhat.com wrote: On 11/02/15 15:49, Gary Kotton wrote: Hi, I do not think that that is a healthy solution. That effectively would render a cluster down if the compute node goes down. That would be a real disaster. The ugly work around is setting the host names to be the same value. I don't think that's an ugly work around. I think that's the only currently viable solution. This is something that we should discuss at the next summit and I would hope to propose a topic to talk about. Sounds like a good plan. However, given that the bug is marked Critical I was assuming we wanted a more expedient fix, which is what I've proposed. Matt Thanks Gary On 2/11/15, 5:31 PM, Matthew Booth mbo...@redhat.com wrote: I just posted this: https://review.openstack.org/#/c/154907/ as an alternative fix for critical bug: https://bugs.launchpad.net/nova/+bug/1419785 I've just knocked this up quickly for illustration: it obviously needs plenty of cleanup. I have confirmed that it works, though. Before I take it any further, though, I'd like to get some feedback on the approach. I prefer this to the alternative, because the underlying problem is deeper than supporting evacuate. I'd prefer to be honest with the user and just say it ain't gonna work. The alternative would leave Nova running in a broken state, leaving inconsistent state in its wake as it runs. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ __ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames
I just posted this: https://review.openstack.org/#/c/154907/ as an alternative fix for critical bug: https://bugs.launchpad.net/nova/+bug/1419785 I've just knocked this up quickly for illustration: it obviously needs plenty of cleanup. I have confirmed that it works, though. Before I take it any further, though, I'd like to get some feedback on the approach. I prefer this to the alternative, because the underlying problem is deeper than supporting evacuate. I'd prefer to be honest with the user and just say it ain't gonna work. The alternative would leave Nova running in a broken state, leaving inconsistent state in its wake as it runs. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 10/02/15 18:29, Jay Pipes wrote: On 02/10/2015 09:47 AM, Matthew Booth wrote: On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. This link is better: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ Specifically the line: The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Thanks, Jay, that's a great article. Based on that, I think I may have misunderstood what you were saying before. I currently understand that the behaviour of select ... for update is correct on Galera, it's just not very efficient. Correct in this case meaning it aborts the transaction due to a correctly detected lock conflict. FWIW, that was pretty much my original understanding, but without the detail. To expand: Galera doesn't replicate write intent locks, but it turns out it doesn't have to for correctness. The reason is that the conflict between a local write intent lock and a remote write, which is replicated, will always be detected during or before local certification. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all] Theory of Testing Cross Project Spec
On 09/02/15 13:59, Sean Dague wrote: Comments are welcomed, please keep typos / grammar as '0' comments to separate them from -1 comments. Also, I ask any -1s to be extremely clear about the core concern of the -1 so we can figure out how to make progress. Incidentally, I think these are good guidelines for comments on all reviews. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [all][oslo.db] Repeatable Read considered harmful
I was surprised recently to discover that MySQL uses repeatable read for transactions by default. Postgres uses read committed by default, and SQLite uses serializable. We don't set the isolation level explicitly anywhere, so our applications are running under different isolation levels depending on backend. This doesn't sound like a good idea to me. It's one thing to support multiple sql syntaxes, but different isolation levels have different semantics. Supporting that is much harder, and currently we're not even trying. I'm aware that the same isolation level on different databases will still have subtly different semantics, but at least they should agree on the big things. I think we should pick one, and it should be read committed. Also note that 'repeatable read' on both MySQL and Postgres is actually snapshot isolation, which isn't quite the same thing. For example, it doesn't get phantom reads. The most important reason I think we need read committed is recovery from concurrent changes within the scope of a single transaction. To date, in Nova at least, this hasn't been an issue as transactions have had an extremely small scope. However, we're trying to expand that scope with the new enginefacade in oslo.db: https://review.openstack.org/#/c/138215/ . With this expanded scope, transaction failure in a library function can't simply be replayed because the transaction scope is larger than the function. So, 3 concrete examples of how repeatable read will make Nova worse: * https://review.openstack.org/#/c/140622/ This was committed to Nova recently. Note how it involves a retry in the case of concurrent change. This works fine, because the retry is creates a new transaction. However, if the transaction was larger than the scope of this function this would not work, because each iteration would continue to read the old data. The solution to this is to create a new transaction. However, because the transaction is outside of the scope of this function, the only thing we can do locally is fail. The caller then has to re-execute the whole transaction, or fail itself. This is a local concurrency problem which can be very easily handled locally, but not if we're using repeatable read. * https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L4749 Nova has multiple functions of this type which attempt to update a key/value metadata table. I'd expect to find multiple concurrency issues with this if I stopped to give it enough thought, but concentrating just on what's there, notice how the retry loop starts a new transaction. If we want to get to a place where we don't do that, with repeatable read we're left failing the whole transaction. * https://review.openstack.org/#/c/136409/ This one isn't upstream, yet. It's broken, and I can't currently think of a solution if we're using repeatable read. The issue is atomic creation of a shared resource. We want to handle a creation race safely. This patch: * Attempts to reads the default (it will normally exist) * Creates a new one if it doesn't exist * Goes back to the start if creation failed due to a duplicate Seems fine, but it will fail because the re-read will continue to not return the new value under repeatable read (no phantom reads). The only way to see the new row is a new transaction. Is this will no longer be in the scope of this function, the only solution will be to fail. Read committed could continue without failing. Incidentally, this currently works by using multiple transactions, which we are trying to avoid. It has also been suggested that in this specific instance the default security group could be created with the project. However, that would both be more complicated, because it would require putting a hook into another piece of code, and less robust, because it wouldn't recover if somebody deleted the default security group. To summarise, with repeatable read we're forced to abort the current transaction to deal with certain relatively common classes of concurrency issue, whereas with read committed we can safely recover. If we want to reduce the number of transactions we're using, which we do, the impact of this is going to dramatically increase. We should standardise on read committed. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 04/02/15 19:04, Jay Pipes wrote: On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote: On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here: http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/ Which explains this. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 It's not an issue with MySQL. It's an issue with any database code that is highly contentious. Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera. The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections. I'm still confused as to how this code got there, though. We shouldn't be hitting Galera lock contention (reported as deadlocks) if we're using a single master, which I thought we were. Does this mean either: A. There are deployments using multi-master? B. These are really deadlocks? If A, is this something we need to continue to support? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 11:11, Sahid Orentino Ferdjaoui wrote: I'm still confused as to how this code got there, though. We shouldn't be hitting Galera lock contention (reported as deadlocks) if we're using a single master, which I thought we were. Does this mean either: I guess we can hit a lock contention even in single master. I don't think so, but you can certainly still have real deadlocks. They're bugs, though. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] stuck patches at the nova IRC meeting
On 05/02/15 15:02, Alexis Lee wrote: Sean Dague said on Wed, Feb 04, 2015 at 09:51:30AM -0500: As there has been a bunch of concern around patches getting lost or stuck, I wanted to re-announce the fact that we've got a dedicated slot at the weekly Nova meeting for just those sorts of things. The slot turned into everyone talking at once fairly quickly this week. That led to several patches not getting a real discussion. May I suggest stricter moderation? EG a short phase to propose items, then work through them 1 by 1. Or, we take items one by one according to who shouts fastest but ask people not to interrupt. +1 Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 04:30, Mike Bayer wrote: Galera doesn't change anything here. I'm really not sure what the fuss is about, frankly. because we’re trying to get Galera to actually work as a load balanced cluster to some degree, at least for reads. Yeah, the use case of concern here is consecutive RPC transactions from a single remote client, which can't reasonably be in the same transaction. This affects semantics visible to the end-user. In Nova, they might do: $ nova aggregate-create ... $ nova aggregate-details ... Should they expect that the second command might fail if they don't pause long enough between the 2? Should they retry until it succeeds? This example is a toy, but I would expect to find many other more subtle examples. Otherwise I’m not really sure why we have to bother with Galera at all. If we just want a single MySQL server that has a warm standby for failover, why aren’t we just using that capability straight from MySQL. Then we get “SELECT FOR UPDATE” and everything else back. Actually I think this is a misconception. If I have understood correctly[1], Galera *does* work with select for update. Use of select for update on a single node will work exactly as normal with blocking behaviour. Use of select for update across 2 nodes will not block, but fail on commit if there was lock contention. Galera’s “multi master” capability is already in the trash for us, and it seems like “multi-slave” is only marginally useful either, the vast majority of openstack has to be 100% pointed at just one node to work correctly. It's not necessarily in the trash, but given that the semantics are different (fail on commit rather than block) we'd need to do more work to support them. It sounds to me that we want to defer that rather than try to fix it now. i.e. multi-master is currently unsupport(ed|able). We could add an additional decorator to enginefacade which would re-execute a @writer block if it detected Galera lock contention. However, given that we'd have to audit that code for other side-effects, for the moment it sounds like it's safer to fail. Matt [1] Standard caveats apply. -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 11:01, Attila Fazekas wrote: I have a question related to deadlock handling as well. Why the DBDeadlock exception is not caught generally for all api/rpc request ? The mysql recommendation regarding to Deadlocks [1]: Normally, you must write your applications so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock. This is evil imho, although it may well be pragmatic. A deadlock (a real deadlock, that is) occurs because of a preventable bug in code. It occurs because 2 transactions have attempted to take multiple locks in a different order. Getting this right is hard, but it is achievable. The solution to real deadlocks is to fix the bugs. Galera 'deadlocks' on the other hand are not deadlocks, despite being reported as such (sounds as though this is due to an implementation quirk?). They don't involve 2 transactions holding mutual locks, and there is never any doubt about how to proceed. They involve 2 transactions holding the same lock, and 1 of them committed first. In a real deadlock they wouldn't get as far as commit. This isn't any kind of bug: it's normal behaviour in this environment and you just have to handle it. Now the services are just handling the DBDeadlock in several places. We have some logstash hits for other places even without galera. I haven't had much success with logstash. Could you post a query which would return these? This would be extremely interesting. Instead of throwing 503 to the end user, the request could be repeated `silently`. The users would be able repeat the request himself, so the automated repeat should not cause unexpected new problem. Good point: we could argue 'no worse than now', even if it's buggy. The retry limit might be configurable, the exception needs to be watched before anything sent to the db on behalf of the transaction or request. Considering all request handler as potential deadlock thrower seams much easier than, deciding case by case. Well this happens at the transaction level, and we don't quite have a 1:1 request:transaction relationship. We're moving towards it, but potentially long running requests will always have to use multiple transactions. However, I take your point. I think retry on transaction failure is something which would benefit from standard handling in a library. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote: * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 Right, and that remains a significant source of confusion and obfuscation in the db api. Our db code is littered with races and potential actual deadlocks, but only some functions are decorated. Are they decorated because of real deadlocks, or because of Galera lock contention? The solutions to those 2 problems are very different! Also, hunting deadlocks is hard enough work. Adding the possibility that they might not even be there is just evil. Incidentally, we're currently looking to replace this stuff with some new code in oslo.db, which is why I'm looking at it. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Essentially, anywhere that a regular DB would block, Galera will not block transactions on different nodes. Instead, it will cause one of the transactions to fail on commit. This is still ACID, but the semantics are quite different. The impact of this is that code which makes correct use of locking may still fail with a 'deadlock'. The solution to this is to either fail the entire operation, or to re-execute the transaction and all its associated code in the expectation that it won't fail next time. As I understand it, these can be eliminated by sending all writes to a single node, although that obviously makes less efficient use of your cluster. * Write followed by read on a different node can return stale data During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. Matt [1] No, really: I just read a bunch of docs and blogs today. If anybody who is an expert would like to validate/correct that would be great. [2] http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/ [3] http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db][nova] Use of asynchronous slaves in Nova (was: Deprecating use_slave in Nova)
On 30/01/15 19:06, Mike Bayer wrote: Matthew Booth mbo...@redhat.com wrote: At some point in the near future, hopefully early in L, we're intending to update Nova to use the new database transaction management in oslo.db's enginefacade. Spec: http://git.openstack.org/cgit/openstack/oslo-specs/plain/specs/kilo/make-enginefacade-a-facade.rst Implementation: https://review.openstack.org/#/c/138215/ One of the effects of this is that we will always know when we are in a read-only transaction, or a transaction which includes writes. We intend to use this new contextual information to make greater use of read-only slave databases. We are currently proposing that if an admin has configured a slave database, we will use the slave for *all* read-only transactions. This would make the use_slave parameter passed to some Nova apis redundant, as we would always use the slave where the context allows. However, using a slave database has a potential pitfall when mixed with separate write transactions. A caller might currently: 1. start a write transaction 2. update the database 3. commit the transaction 4. start a read transaction 5. read from the database The client might expect data written in step 2 to be reflected in data read in step 5. I can think of 3 cases here: 1. A short-lived RPC call is using multiple transactions This is a bug which the new enginefacade will help us eliminate. We should not be using multiple transactions in this case. If the reads are in the same transaction as the write: they will be on the master, they will be consistent, and there is no problem. As a bonus, lots of these will be race conditions, and we'll fix at least some. 2. A long-lived task is using multiple transactions between long-running sub-tasks In this case, for example creating a new instance, we genuinely want multiple transactions: we don't want to hold a database transaction open while we copy images around. However, I can't immediately think of a situation where we'd write data, then subsequently want to read it back from the db in a read-only transaction. I think we will typically be updating state, meaning it's going to be a succession of write transactions. 3. Separate RPC calls from a remote client This seems potentially problematic to me. A client makes an RPC call to create a new object. The client subsequently tries to retrieve the created object, and gets a 404. Summary: 1 is a class of bugs which we should be able to find fairly mechanically through unit testing. 2 probably isn't a problem in practise? 3 seems like a problem, unless consumers of cloud services are supposed to expect that sort of thing. I understand that slave databases can occasionally get very behind. How behind is this in practise? How do we use use_slave currently? Why do we need a use_slave parameter passed in via rpc, when it should be apparent to the developer whether a particular task is safe for out-of-date data. Any chance they have some kind of barrier mechanism? e.g. block until the current state contains transaction X. General comments on the usefulness of slave databases, and the desirability of making maximum use of them? keep in mind that the big win we get from writer()/ reader() is that writer() can remain pointing to one node in a Galera cluster, and reader() can point to the cluster as a whole. reader() by default should definitely refer to the cluster as a whole, that is, “use slave”. As for issue #3, galera cluster is synchronous replication. Slaves don’t get “behind” at all. So to the degree that we need to transparently support some other kind of master/slave where slaves do get behind, perhaps there would be a reader(synchronous_required=True) kind of thing; based on configuration, it would be known that “synchronous” either means we don’t care (using galera) or that we should use the writer (an asynchronous replication scheme). This sounds like the crux of the matter to me. After some (admittedly cursory) reading, it seems that galera can use both synchronous and asynchronous replication. Up until Friday I had only ever considered synchronous replication, which would not be a problem. I think opportunistically using synchronous slaves whenever possible could only be a win. Are there any unpleasant practicalities which might mean this isn't the case? However, it sounds to me like there is at least some OpenStack deployment in production using asynchronous slaves, otherwise the issue of 'getting behind' wouldn't have come up. We need to understand: * Are people actually using asynchronous slaves? * If so, why did they choose to do that, and * what are they using them for? All of this points to the fact that I really don’t think the directives / flags should say anything about which specific database to use; using a “slave” or not due to various concerns is dependent on backend implementation and configuration. The purpose of reader
[openstack-dev] [oslo.db][nova] Deprecating use_slave in Nova
At some point in the near future, hopefully early in L, we're intending to update Nova to use the new database transaction management in oslo.db's enginefacade. Spec: http://git.openstack.org/cgit/openstack/oslo-specs/plain/specs/kilo/make-enginefacade-a-facade.rst Implementation: https://review.openstack.org/#/c/138215/ One of the effects of this is that we will always know when we are in a read-only transaction, or a transaction which includes writes. We intend to use this new contextual information to make greater use of read-only slave databases. We are currently proposing that if an admin has configured a slave database, we will use the slave for *all* read-only transactions. This would make the use_slave parameter passed to some Nova apis redundant, as we would always use the slave where the context allows. However, using a slave database has a potential pitfall when mixed with separate write transactions. A caller might currently: 1. start a write transaction 2. update the database 3. commit the transaction 4. start a read transaction 5. read from the database The client might expect data written in step 2 to be reflected in data read in step 5. I can think of 3 cases here: 1. A short-lived RPC call is using multiple transactions This is a bug which the new enginefacade will help us eliminate. We should not be using multiple transactions in this case. If the reads are in the same transaction as the write: they will be on the master, they will be consistent, and there is no problem. As a bonus, lots of these will be race conditions, and we'll fix at least some. 2. A long-lived task is using multiple transactions between long-running sub-tasks In this case, for example creating a new instance, we genuinely want multiple transactions: we don't want to hold a database transaction open while we copy images around. However, I can't immediately think of a situation where we'd write data, then subsequently want to read it back from the db in a read-only transaction. I think we will typically be updating state, meaning it's going to be a succession of write transactions. 3. Separate RPC calls from a remote client This seems potentially problematic to me. A client makes an RPC call to create a new object. The client subsequently tries to retrieve the created object, and gets a 404. Summary: 1 is a class of bugs which we should be able to find fairly mechanically through unit testing. 2 probably isn't a problem in practise? 3 seems like a problem, unless consumers of cloud services are supposed to expect that sort of thing. I understand that slave databases can occasionally get very behind. How behind is this in practise? How do we use use_slave currently? Why do we need a use_slave parameter passed in via rpc, when it should be apparent to the developer whether a particular task is safe for out-of-date data. Any chance they have some kind of barrier mechanism? e.g. block until the current state contains transaction X. General comments on the usefulness of slave databases, and the desirability of making maximum use of them? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] [nova] [all] potential enginefacade adjustment - can everyone live with this?
On 23/01/15 19:47, Mike Bayer wrote: Doug Hellmann d...@doughellmann.com wrote: On Fri, Jan 23, 2015, at 12:49 PM, Mike Bayer wrote: Mike Bayer mba...@redhat.com wrote: Ihar Hrachyshka ihrac...@redhat.com wrote: On 01/23/2015 05:38 PM, Mike Bayer wrote: Doug Hellmann d...@doughellmann.com wrote: We put the new base class for RequestContext in its own library because both the logging and messaging code wanted to influence it's API. Would it make sense to do this database setup there, too? whoa, where’s that? is this an oslo-wide RequestContext class ? that would solve everything b.c. right now every project seems to implement RequestContext themselves. so Doug - How does this “influence of API” occur, would oslo.db import oslo_context.context and patch onto RequestContext at that point? Or the other way around? Or… ? No, it's a social thing. I didn't want dependencies between oslo.messaging and oslo.log, but the API of the context needs to support use cases in both places. Your case might be different, in that we might need to actually have oslo.context depend on oslo.db in order to call some setup code. We'll have to think about whether that makes sense and what other dependencies it might introduce between the existing users of oslo.context. hey Doug - for the moment, I have oslo_db.sqlalchemy.enginefacade applying its descriptors at import time onto oslo_context: https://review.openstack.org/#/c/138215/30/oslo_db/sqlalchemy/enginefacade.py https://review.openstack.org/gitweb?p=openstack/oslo.db.git;a=blob;f=oslo_db/sqlalchemy/enginefacade.py;h=3f76678a6c9788f62288c8fa5ef520db8dff2c0a;hb=bc33d20dc6db2f8e5f8cb02b4eb5f97d24dafb7a#l692 https://review.openstack.org/gitweb?p=openstack/oslo.db.git;a=blob;f=oslo_db/sqlalchemy/enginefacade.py;h=3f76678a6c9788f62288c8fa5ef520db8dff2c0a;hb=bc33d20dc6db2f8e5f8cb02b4eb5f97d24dafb7a#l498 May I suggest that we decouple these changes by doing both? Oslo's RequestContext object can have the enginefacade decorator applied to it, so any project which uses it doesn't have to apply it themselves. Meanwhile, the decorator remains part of the public api for projects not using the oslo RequestContext. I suggest that we'll probably stay with a decorator within oslo, anyway. It doesn't lend itself well to a Mixin, as we need a reference to a specific _TransactionContextManager, and moving code directly into RequestContext would be a very invasive coupling. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db][nova] NovaObject.save() needs its own DB transaction
back to the justification for removing that facility would help us understand the challenges we face going forward? I don't know Boris. I can reach out to him if you think we need his input. However, I think the challenges we have today are fairly clear. [1] At a slight tangent, this looks like an artifact of some premature generalisation a few years ago. It seems unlikely that anybody is going to rewrite the db api using an ORM other than sqlalchemy, so we should probably ditch it and promote it to db/api.py. We've had a few people ask about it, in terms of rewriting some or all of our DB API to talk to a totally non-SQL backend. Further, AFAIK, RAX rewrites a few of the DB API calls to use raw SQL queries for performance (or did, at one point). I'm quite happy to have the implementation of Instance.save() make use of primitives to ensure atomicity where appropriate. I don't think that's something that needs or deserves generalization at this point, and I'm not convinced it needs to be in the save method itself. Right now we update several things atomically by passing something to db/api that gets turned into properly-related SQLA objects. I think we could do the same for any that we're currently cascading separately, even if the db/api update method uses a transaction to ensure safety. As I mention above, the problem with this approach is the separation of concerns between NovaObjects and the db api. E.g. InfoCache jsonifies its data in save() before sending it to the db. If we move that logic to db.instance_update_and_get_original(), how do we ensure the same logic applies when we save the info cache directly? It's certainly achievable, but it's just adding to the mess. My proposal is safe, efficient, and simple. Matt - -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 -BEGIN PGP SIGNATURE- Version: GnuPG v1 iEYEARECAAYFAlRuJP8ACgkQNEHqGdM8NJBgygCfcrZ6PSpeGaJIzkmNKGVIK/Ut I3YAnjdc1kxONRhJK5ET1HY2kKHFTFT+ =KbI+ -END PGP SIGNATURE- ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev