[vdsm] Fedora, udev and nic renaming
Hi, We are currently working on stabilizing the networking part of vdsm in Fedora 18 and, to achieve that purpose, we decided to test in in both physical hosts and, for extra convenience and better support, also in VMs. Due to the move of Fedora 17 and 18 to systemd and newer udev versions, we encountered some issues that should be noted and worked on to provide our users with a hassle-free experience. First of all, let me state what happens (in renaming) in RHEL-6.x when a new ethernet device is handled by udev: a) One or more udev rules match the characteristics of the interface: The last matching rule is applied. b) No rule is matching: /lib/udev/write-net-rules writes a permanent rule using the MAC address of the interface in a udev rules file, so the interface name will be permanent and in the ethX namespace. In Fedora 17 (but even more so in F18), with the move to a newer version of udev and, specially, with the change from sysV init to systemd, the mechanism changed. Since systemd is making the boot happen in a parallelized way, some changes had to be enforced in udev to keep the renaming working: - To avoid naming collisions, it was decided to use Dell's biosdevname software to retrieve a device name for the network interfaces. Typically emX for onboard nics and pXpY for pci connected nics. - For devices which biosdevname could not provide any information, it was agreed to assign them a name in the ethX space in a first-come, first-served fashion. - Optionally, one could define the interace MAC addr in an ifcfg file and /lib/udev/rename-device would look into the ifcfg file and assign the device name there set (I have not yet succeeded in that part, I have to investigate more, I guess). As you can see, biosdevname, never reports names in the eth space to avoid collision with a potential parallel discovery of an interface not recognizable by it, to which the kernel could have assigned already a bios reported name. For physical machines this approach works fine. However, for Virtual machines with more than one nic, the automatic process described above presents some issues. Biosdevname, due to the different ways the virtualization hypervisors report the vnics, dropped support for VMs in 0.3.7 (F18 uses 0.4.1-2) and decided that on VMs, it would just return 4 to indicate to udev to use kernel first-come, first-served for those interfaces (ethX namespace). The issue with using first-come first-served, is that due to the highly parallelized boot there is now, it is very common to encounter that the names of your devices (as identified by MAC address) suffer a permutation upon each reboot. Here you can see an example: NOTE: The libvirt dump of the VM reports the same PCI address for each interface across reboots. Boot 0 (Nov 13th 14:59) eth0: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:54:85:57 txqueuelen 1000 (Ethernet) eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:77:45:6b txqueuelen 1000 (Ethernet) eth2: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:ca:41:c7 txqueuelen 1000 (Ethernet) eth3: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:f5:3d:c8 txqueuelen 1000 (Ethernet) eth4: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:5e:10:76 txqueuelen 1000 (Ethernet) eth5: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:95:00:93 txqueuelen 1000 (Ethernet) Boot 1 (Nov 13th 15:01) eth0: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:ca:41:c7 txqueuelen 1000 (Ethernet) eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:54:85:57 txqueuelen 1000 (Ethernet) eth2: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:77:45:6b txqueuelen 1000 (Ethernet) eth3: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:f5:3d:c8 txqueuelen 1000 (Ethernet) eth4: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:5e:10:76 txqueuelen 1000 (Ethernet) eth5: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:95:00:93 txqueuelen 1000 (Ethernet) As you can see, after rebooting: eth0 - eth1 eth1 - eth2 eth2 - eth0 This is an issue if different vnics are connected to different networks or for whichever reason require distinct configuration. To solve this issue, on the guest there are three options: - Assign somebody with BIOS knowledge to add KVM guest support to biosdevname so we can use the emX/pXpY namespace and maintain a native like experience in the VMs. My intuition is that it could just report pXpY where X is bus and Y is slot. This is the preferred option. - Use libguestfs for setting udev rules using the MAC addresses we know from the VM definition in the netX namespace (I have been told that it is not
Re: [vdsm] Fedora, udev and nic renaming
Thanks for this verbose description. I don't think using libguestfs is the solution for this. Fixing qemu to accept BIOS interface name at -net parameter is preferable. I don't think we should expose the interface a PCI device as it will have some drawbacks, but attempt to use the onboard convention. Alon - Original Message - From: Antoni Segura Puimedon asegu...@redhat.com To: vdsm-devel@lists.fedorahosted.org Sent: Tuesday, December 4, 2012 11:08:31 AM Subject: [vdsm] Fedora, udev and nic renaming Hi, We are currently working on stabilizing the networking part of vdsm in Fedora 18 and, to achieve that purpose, we decided to test in in both physical hosts and, for extra convenience and better support, also in VMs. Due to the move of Fedora 17 and 18 to systemd and newer udev versions, we encountered some issues that should be noted and worked on to provide our users with a hassle-free experience. First of all, let me state what happens (in renaming) in RHEL-6.x when a new ethernet device is handled by udev: a) One or more udev rules match the characteristics of the interface: The last matching rule is applied. b) No rule is matching: /lib/udev/write-net-rules writes a permanent rule using the MAC address of the interface in a udev rules file, so the interface name will be permanent and in the ethX namespace. In Fedora 17 (but even more so in F18), with the move to a newer version of udev and, specially, with the change from sysV init to systemd, the mechanism changed. Since systemd is making the boot happen in a parallelized way, some changes had to be enforced in udev to keep the renaming working: - To avoid naming collisions, it was decided to use Dell's biosdevname software to retrieve a device name for the network interfaces. Typically emX for onboard nics and pXpY for pci connected nics. - For devices which biosdevname could not provide any information, it was agreed to assign them a name in the ethX space in a first-come, first-served fashion. - Optionally, one could define the interace MAC addr in an ifcfg file and /lib/udev/rename-device would look into the ifcfg file and assign the device name there set (I have not yet succeeded in that part, I have to investigate more, I guess). As you can see, biosdevname, never reports names in the eth space to avoid collision with a potential parallel discovery of an interface not recognizable by it, to which the kernel could have assigned already a bios reported name. For physical machines this approach works fine. However, for Virtual machines with more than one nic, the automatic process described above presents some issues. Biosdevname, due to the different ways the virtualization hypervisors report the vnics, dropped support for VMs in 0.3.7 (F18 uses 0.4.1-2) and decided that on VMs, it would just return 4 to indicate to udev to use kernel first-come, first-served for those interfaces (ethX namespace). The issue with using first-come first-served, is that due to the highly parallelized boot there is now, it is very common to encounter that the names of your devices (as identified by MAC address) suffer a permutation upon each reboot. Here you can see an example: NOTE: The libvirt dump of the VM reports the same PCI address for each interface across reboots. Boot 0 (Nov 13th 14:59) eth0: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:54:85:57 txqueuelen 1000 (Ethernet) eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:77:45:6b txqueuelen 1000 (Ethernet) eth2: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:ca:41:c7 txqueuelen 1000 (Ethernet) eth3: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:f5:3d:c8 txqueuelen 1000 (Ethernet) eth4: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:5e:10:76 txqueuelen 1000 (Ethernet) eth5: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:95:00:93 txqueuelen 1000 (Ethernet) Boot 1 (Nov 13th 15:01) eth0: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:ca:41:c7 txqueuelen 1000 (Ethernet) eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:54:85:57 txqueuelen 1000 (Ethernet) eth2: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:77:45:6b txqueuelen 1000 (Ethernet) eth3: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:f5:3d:c8 txqueuelen 1000 (Ethernet) eth4: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:5e:10:76 txqueuelen 1000 (Ethernet) eth5: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 ether 52:54:00:95:00:93 txqueuelen 1000 (Ethernet) As you can see, after rebooting: eth0 - eth1
[vdsm] API.py validation
Hi all, I am currently working in adding a new feature to vdsm which requires a new entry point in vdsm, thus requiring: - Parameter definitions in vdsm_api/vdsmapi-schema.json - Implementation and checks in vdsm/API.py and other modules. Typically, we check for presence absence of required/optional parameters in API.py using utils.validateMinimalKeySet or just if else clauses. I think this process could benefit from a more automatic and less duplicated effort, i.e., parsing vdsmapi-schema.json in a similar way as process-schema.py does to make a memoized method that is able to check whether the api call is correct according to the API definitions. A very good side effect would be that this would really avoid us from forgetting to update the schema. Best regards, Toni ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
[vdsm] VDSM tasks, the future
Because I started hinting about how VDSM tasks are going to look going forward I thought it's better I'll just write everything in an email so we can talk about it in context. This is not set in stone and I'm still debating things myself but it's very close to being done. - Everything is asynchronous. The nature of message based communication is that you can't have synchronous operations. This is not really debatable because it's just how TCP\AMQP\messaging works. - Task IDs will be decided by the caller. This is how json-rpc works and also makes sense because no the engine can track the task without needing to have a stage where we give it the task ID back. IDs are reusable as long as no one else is using them at the time so they can be used for synchronizing operations between clients (making sure a command is only executed once on a specific host without locking). - Tasks are transient If VDSM restarts it forgets all the task information. There are 2 ways to have persistent tasks: 1. The task creates an object that you can continue work on in VDSM. The new storage does that by the fact that copyImage() returns one the target volume has been created but before the data has been fully copied. From that moment on the stat of the copy can be queried from any host using getImageStatus() and the specific copy operation can be queried with getTaskStatus() on the host performing it. After VDSM crashes, depending on policy, either VDSM will create a new task to continue the copy or someone else will send a command to continue the operation and that will be a new task. 2. VDSM tasks just start other operations track-able not through the task interface. For example Gluster. gluster.startVolumeRebalance() will return once it has been registered with Gluster. glster.getOperationStatuses() will return the state of the operation from any host. Each call is a task in itself. - No task tags. They are silly and the caller can mangle whatever in the task ID if he really wants to tag tasks. - No explicit recovery stage. VDSM will be crash-only, there should be efforts to make everything crash-safe. If that is problematic, in case of networking, VDSM will recover on start without having a task for it. - No clean Task: Tasks can be started by any number of hosts this means that there is no way to own all tasks. There could be cases where VDSM starts tasks on it's own and thus they have no owner at all. The caller needs to continually track the state of VDSM. We will have brodcasted events to mitigate polling. - No revert Impossible to implement safely. - No SPM\HSM tasks SPM\SDM is no longer necessary for all domain types (only for type). What used to be SPM tasks, or tasks that persist and can be restarted on other hosts is talked about in previous bullet points. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
[vdsm] link state semantics
Hi list! We are working on the new 3.2 feature for adding support for updating VM devices, more specifically at the moment network devices. There is one point of the design which is not yet consensual and we'd need to agree on a proper and clean design that would satisfy us all: My current proposal, as reflected by patch: http://gerrit.ovirt.org/#/c/9560/5/vdsm_api/vdsmapi-schema.json and its parent is to have a linkActive boolean that is true for link status 'up' and false for link status 'down'. We want to support a none (dummy) network that is used to dissociate vnics from any real network. The semantics, as you can see in the patch are that unless you specify a network, updateDevice will place the interface on that network. However, Adam Litke argues that not specifying a network should keep the vnic on the network it currently is, as network is an optional parameter and 'linkActive' is also optional and has this preserve current state semantics. I can certainly see the merit of what Adam proposes, and the implementation would be that linkActive becomes an enum like so: {'enum': 'linkState'/* or linkActive */ , 'data': ['up', 'down', 'disconnected']} With this change, network would only be changed if one different than the current one is specified and the vnic would be taken to the dummy bridge when the linkState would be set to 'disconnected'. There is also an objection, raised by Adam about the semantics of portMirroring. The current behavior from my patch is: portMirroring is None or is not set - No action taken. portMirroring = [] - No action taken. portMirroring = [a,b,z] - Set port mirroring for nets a,b and z to the specified vnic. His proposal is: portMirroring is None or is not set - No action taken. portMirroring = [] - Unset port mirroring to the vnic that is currently set. portMirroring = [a,b,z] - Set port mirroring for nets a,b and z to the specified vnic. I would really welcome comments on this to have finally an agreement to the api for this feature. Best, Toni ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] Back to future of vdsm network configuration
- Original Message - From: Itamar Heim ih...@redhat.com To: Dan Kenigsberg dan...@redhat.com Cc: Alon Bar-Lev alo...@redhat.com, VDSM Project Development vdsm-devel@lists.fedorahosted.org, Simon Grinberg si...@redhat.com, Andrew Cathrow acath...@redhat.com Sent: Monday, December 3, 2012 10:56:53 PM Subject: Re: [vdsm] Back to future of vdsm network configuration On 12/03/2012 06:54 PM, Dan Kenigsberg wrote: On Mon, Dec 03, 2012 at 04:28:16PM +0200, Itamar Heim wrote: On 12/03/2012 04:25 PM, Dan Kenigsberg wrote: On Mon, Dec 03, 2012 at 04:35:34AM -0500, Alon Bar-Lev wrote: - Original Message - From: Mark Wu wu...@linux.vnet.ibm.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Cc: Alon Bar-Lev alo...@redhat.com, Dan Kenigsberg dan...@redhat.com, Simon Grinberg si...@redhat.com, Antoni Segura Puimedon asegu...@redhat.com, Igor Lvovsky ilvov...@redhat.com, Daniel P. Berrange berra...@redhat.com Sent: Monday, December 3, 2012 7:39:49 AM Subject: Re: [vdsm] Back to future of vdsm network configuration On 11/29/2012 04:24 AM, Alon Bar-Lev wrote: - Original Message - From: Dan Kenigsberg dan...@redhat.com To: Alon Bar-Lev alo...@redhat.com Cc: Simon Grinberg si...@redhat.com, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Wednesday, November 28, 2012 10:20:11 PM Subject: Re: [vdsm] MTU setting according to ifcfg files. On Wed, Nov 28, 2012 at 12:49:10PM -0500, Alon Bar-Lev wrote: Itamar though a bomb that we should co-exist on generic host, this is something I do not know to compute. I still waiting for a response of where this requirement came from and if that mandatory. This bomb has been ticking since ever. We have ovirt-node images for pure hypervisor nodes, but we support plain Linux nodes, where local admins are free to `yum upgrade` in the least convenient moment. The latter mode can be the stuff that nightmares are made of, but it also allows the flexibility and bleeding-endgeness we all cherish. There is a different between having generic OS and having generic setup, running your email server, file server and LDAP on a node that running VMs. I have no problem in having generic OS (opposed of ovirt-node) but have full control over that. Alon. Can I say we have got agreement on oVirt should cover two kinds of hypervisors? Stateless slave is good for pure and normal virtualization workload, while generic host can keep the flexibility of customization. In my opinion, it's good for the oVirt community to provide choices for users. They could customize it in production, building and even source code according to their requirements and skills. I also think it will be good to support both modes! It will also good if we can rule the world! :) Now seriously... :) If we want to ever have a working solution we need to focus, dropping wishful requirements in favour of the minimum required that will allow us to reach to stable milestone. Having a good clean interface for vdsm network within the stateless mode, will allow a persistent implementation to exists even if the whole implementation of master and vdsm assume stateless. This kind of implementation will get a new state from master, compare to whatever exists on the host and sync. I, of course, will be against investing resources in such network management plugin approach... but it is doable, and my vote is not something that you cannot safely ignore. I cannot say that I do not fail to parse English sentences with double or triple negations... I'd like to see an API that lets us define a persistent initial management interface, and create volatile network devices during runtime. I'd love to see a define/create distiction, as libvirt has. How about keeping our current setupNetwork API, with a minor change to its sematics - it would not persist anything. A new persistNetwork API would be added, intending to persist the management network after it has been tested. On boot, only the management defitions would show up, and Engine (or a small local sevice on top of vdsm) would push the complete configuration. how does this benefit over loading the last config, and then have engine refresh (always/if needed)? It's clearer for the local admin: if it's on the file system, it would be there after boot; he can do his worst to them, and we'd try to manage. Also, it is easier to recover from utterly-horrible remote commands, which had rendered our host incommunicado: the management interface used to send these commands -- and only it -- would show up after boot. This increases the probability that after fencing, we'd see the host again. i think we mentioned this before, but this will kill any way to have hosts come back to
Re: [vdsm] object instancing in the new VDSM API
Thanks for your detailed response... On Mon, Dec 03, 2012 at 09:26:34PM -0500, Saggi Mizrahi wrote: So from what I gather the only thing that is bothering you is that storage operations require a lot of IDs. I get that, I hate that to. It doesn't change the point that it was designed that way. Even if you deem some use cases irrelevant it wouldn't change the fact that this is how people use it now. And because we are going to throw it away as soon as we can there is no reason to shape out API around that. In that case, I want to throw away the bad architecture along with the bad API. In the future I would like to see: 1) Objects can be uniquely identified by a single UUID. This means you would not be able to reuse the same UUID on a different host/domain unless you are talking about the same object (ie. move image). If we think this is going to be a problem, let's discuss the specific use cases. 2) Verbs should not have non-obvious preconditions or overloaded semantics (basically, we need to get rid of the issues with storage pools and images that you explain below). So from what I gather we agree on instancing. Sure. I am willing to adopt namespaces instead of instances as long as the above is adhered to in the new design. I do have to ask again, when do you think the new storage stuff will be ready for serious review, testing and consideration for merging? I would be happy to spend a significant amount of time helping out with this if the end result has us closing on this 2+ year endeavor :) --- From this moment on I'm going to try my best to explain how VDSM storage currently works. It is filled with misdirection and bad design. I hope that after that you will understand why you can't pack all the IDs together. Let's start with the storage pool. Because it was simpler to have all metadata changing operations run on the same host someone needed to find a way to make cross domain operations work on the same host. The solution was to band them all to a single entity call the storage pool and have a single lock. The point was to have a host be able to connect to multiple pools at a time. Due to bad code (that could have been easily not have been so bad) the multiple pools feature was never implemented. Because the single lock to rule them all doesn't really work when you want to secure domain we had to add more locks making the pool concept obsolete. These means that you can trust VDSM to only be connected to a single pool at the time, this means that if you want to change anything you can just remove the pool arg. Is there a reason that vdsm doesn't automatically connect to the pool noted in the master storage domain? It's fine if it doesn't become spm, but it would be nice to reduce the number of steps required to bring storage back up after a reboot. Also, do you see significant changes in the storage domain related verbs? I guess we will remove the attach/detach/activate/deactivate verbs since storage pools are going away. Lets go to volumes and images. Contrary to how it's name imgUUID does not represent and image. It's actually a tag given to part of a chain. This is commonly used to differentiate between parts of the chain responsible for VM images and templates. Due to bad code a lot of the possible combinations are not supported but that is the intention. imgUUID being a tag means that it serves 3 purposes depending on the verb that uses it. 1) In some verbs it used as a useless sanity check to make sure the volume is tagged with this sdUUID. This I imagine was done because someone didn't fully comprehend how and why you do sanity checks. This means that in some verbs you can just remove it (if you are actually changing anything) 2) In some verbs it's meant to distinct the volume from it's original chain (creating a template). At that point it's actually now being invented by the caller. 3) Operations that act on the whole chain, if volUUID is there is for the same useless sanity check and can be removed. What you need to get out of this is that most of the time you can use less IDs just by removing useless imgUUID or volUUID args. Further more, you need to understand that they are not hierarchical. imgUUID is a tag on the volume. similar to user for a file. As for domain IDs, because the caller can choose to reuse imgUUIDs and volUUIDs on different domains and some flows actually depend on that. To make things simpler some verbs should be split up so how you specify that target volID doesn't affect the actual command. This means that copyImage() and createTemplate() should be split to: copyImage(dstDomain, srcDomain, imgUUID) createTemplate(dstDomain, dstImgUUID, srcDomain, srcImgUUID) That being said, I'm personally still against an indeterminate storage API because of engine adoption problems. But if you want to fix the current interface. Packing up the IDs to a single ID wouldn't work and is
Re: [vdsm] Back to future of vdsm network configuration
On 12/04/2012 07:49 PM, Simon Grinberg wrote: - Original Message - From: Itamar Heim ih...@redhat.com To: Dan Kenigsberg dan...@redhat.com Cc: Alon Bar-Lev alo...@redhat.com, VDSM Project Development vdsm-devel@lists.fedorahosted.org, Simon Grinberg si...@redhat.com, Andrew Cathrow acath...@redhat.com Sent: Monday, December 3, 2012 10:56:53 PM Subject: Re: [vdsm] Back to future of vdsm network configuration On 12/03/2012 06:54 PM, Dan Kenigsberg wrote: On Mon, Dec 03, 2012 at 04:28:16PM +0200, Itamar Heim wrote: On 12/03/2012 04:25 PM, Dan Kenigsberg wrote: On Mon, Dec 03, 2012 at 04:35:34AM -0500, Alon Bar-Lev wrote: - Original Message - From: Mark Wu wu...@linux.vnet.ibm.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Cc: Alon Bar-Lev alo...@redhat.com, Dan Kenigsberg dan...@redhat.com, Simon Grinberg si...@redhat.com, Antoni Segura Puimedon asegu...@redhat.com, Igor Lvovsky ilvov...@redhat.com, Daniel P. Berrange berra...@redhat.com Sent: Monday, December 3, 2012 7:39:49 AM Subject: Re: [vdsm] Back to future of vdsm network configuration On 11/29/2012 04:24 AM, Alon Bar-Lev wrote: - Original Message - From: Dan Kenigsberg dan...@redhat.com To: Alon Bar-Lev alo...@redhat.com Cc: Simon Grinberg si...@redhat.com, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Wednesday, November 28, 2012 10:20:11 PM Subject: Re: [vdsm] MTU setting according to ifcfg files. On Wed, Nov 28, 2012 at 12:49:10PM -0500, Alon Bar-Lev wrote: Itamar though a bomb that we should co-exist on generic host, this is something I do not know to compute. I still waiting for a response of where this requirement came from and if that mandatory. This bomb has been ticking since ever. We have ovirt-node images for pure hypervisor nodes, but we support plain Linux nodes, where local admins are free to `yum upgrade` in the least convenient moment. The latter mode can be the stuff that nightmares are made of, but it also allows the flexibility and bleeding-endgeness we all cherish. There is a different between having generic OS and having generic setup, running your email server, file server and LDAP on a node that running VMs. I have no problem in having generic OS (opposed of ovirt-node) but have full control over that. Alon. Can I say we have got agreement on oVirt should cover two kinds of hypervisors? Stateless slave is good for pure and normal virtualization workload, while generic host can keep the flexibility of customization. In my opinion, it's good for the oVirt community to provide choices for users. They could customize it in production, building and even source code according to their requirements and skills. I also think it will be good to support both modes! It will also good if we can rule the world! :) Now seriously... :) If we want to ever have a working solution we need to focus, dropping wishful requirements in favour of the minimum required that will allow us to reach to stable milestone. Having a good clean interface for vdsm network within the stateless mode, will allow a persistent implementation to exists even if the whole implementation of master and vdsm assume stateless. This kind of implementation will get a new state from master, compare to whatever exists on the host and sync. I, of course, will be against investing resources in such network management plugin approach... but it is doable, and my vote is not something that you cannot safely ignore. I cannot say that I do not fail to parse English sentences with double or triple negations... I'd like to see an API that lets us define a persistent initial management interface, and create volatile network devices during runtime. I'd love to see a define/create distiction, as libvirt has. How about keeping our current setupNetwork API, with a minor change to its sematics - it would not persist anything. A new persistNetwork API would be added, intending to persist the management network after it has been tested. On boot, only the management defitions would show up, and Engine (or a small local sevice on top of vdsm) would push the complete configuration. how does this benefit over loading the last config, and then have engine refresh (always/if needed)? It's clearer for the local admin: if it's on the file system, it would be there after boot; he can do his worst to them, and we'd try to manage. Also, it is easier to recover from utterly-horrible remote commands, which had rendered our host incommunicado: the management interface used to send these commands -- and only it -- would show up after boot. This increases the probability that after fencing, we'd see the host again. i think we mentioned this before, but this will kill any way to have hosts come back to life, also have a policy on connecting to storage, even if engine is still down. (one of these use cases is for the engine itself to be hosted on the hosts as well) For this use case you'll need much
Re: [vdsm] VDSM tasks, the future
On Tue, Dec 04, 2012 at 10:35:01AM -0500, Saggi Mizrahi wrote: Because I started hinting about how VDSM tasks are going to look going forward I thought it's better I'll just write everything in an email so we can talk about it in context. This is not set in stone and I'm still debating things myself but it's very close to being done. Don't debate them yourself, debate them here! Even better, propose your idea in schema form to show how a command might work exactly. - Everything is asynchronous. The nature of message based communication is that you can't have synchronous operations. This is not really debatable because it's just how TCP\AMQP\messaging works. Can you show how a traditionally synchronous command might work? Let's take Host.getVmList as an example. - Task IDs will be decided by the caller. This is how json-rpc works and also makes sense because no the engine can track the task without needing to have a stage where we give it the task ID back. IDs are reusable as long as no one else is using them at the time so they can be used for synchronizing operations between clients (making sure a command is only executed once on a specific host without locking). - Tasks are transient If VDSM restarts it forgets all the task information. There are 2 ways to have persistent tasks: 1. The task creates an object that you can continue work on in VDSM. The new storage does that by the fact that copyImage() returns one the target volume has been created but before the data has been fully copied. From that moment on the stat of the copy can be queried from any host using getImageStatus() and the specific copy operation can be queried with getTaskStatus() on the host performing it. After VDSM crashes, depending on policy, either VDSM will create a new task to continue the copy or someone else will send a command to continue the operation and that will be a new task. 2. VDSM tasks just start other operations track-able not through the task interface. For example Gluster. gluster.startVolumeRebalance() will return once it has been registered with Gluster. glster.getOperationStatuses() will return the state of the operation from any host. Each call is a task in itself. I worry about this approach because every command has a different semantic for checking progress. For migration, we have to check VM status on the src and dest hosts. For image copy we need to use a special status call on the dest image. It would be nice if there was a unified method for checking on an operation. Maybe that can be completion events. Client: vdsm: --- - Image.copy(...) -- -- Operation Started Wait for event ... -- Event: Operation id done code For an early error: Client: vdsm: --- - Image.copy(...) -- -- Error: code - No task tags. They are silly and the caller can mangle whatever in the task ID if he really wants to tag tasks. Yes. Agreed. - No explicit recovery stage. VDSM will be crash-only, there should be efforts to make everything crash-safe. If that is problematic, in case of networking, VDSM will recover on start without having a task for it. How does this work in practice for something like creating a new image from a template? - No clean Task: Tasks can be started by any number of hosts this means that there is no way to own all tasks. There could be cases where VDSM starts tasks on it's own and thus they have no owner at all. The caller needs to continually track the state of VDSM. We will have brodcasted events to mitigate polling. If a disconnected client might have missed a completion event, it will need to check state. This means each async operation that changes state must document a proceedure for checking progress of a potentially ongoing operation. For Image.copy, that process would be to lookup the new image and check its state. - No revert Impossible to implement safely. How do the engine folks feel about this? I am ok with it :) - No SPM\HSM tasks SPM\SDM is no longer necessary for all domain types (only for type). What used to be SPM tasks, or tasks that persist and can be restarted on other hosts is talked about in previous bullet points. A nice simplification. -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] API.py validation
On Tue, Dec 04, 2012 at 08:43:11AM -0500, Antoni Segura Puimedon wrote: Hi all, I am currently working in adding a new feature to vdsm which requires a new entry point in vdsm, thus requiring: - Parameter definitions in vdsm_api/vdsmapi-schema.json - Implementation and checks in vdsm/API.py and other modules. Typically, we check for presence absence of required/optional parameters in API.py using utils.validateMinimalKeySet or just if else clauses. I think this process could benefit from a more automatic and less duplicated effort, i.e., parsing vdsmapi-schema.json in a similar way as process-schema.py does to make a memoized method that is able to check whether the api call is correct according to the API definitions. A very good side effect would be that this would really avoid us from forgetting to update the schema. Yes, this is a good idea. I do want to add some checking. For now, the best place to add it would probably be in the DynamicBridge class which dispatches json-rpc calls to the correct internal methods. Unfortunately this would exclude the xmlrpc api from the automatic checking. I guess that's ok since xmlrpc will be going away. -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] link state semantics
On Tue, Dec 04, 2012 at 12:32:34PM -0500, Antoni Segura Puimedon wrote: Hi list! We are working on the new 3.2 feature for adding support for updating VM devices, more specifically at the moment network devices. There is one point of the design which is not yet consensual and we'd need to agree on a proper and clean design that would satisfy us all: My current proposal, as reflected by patch: http://gerrit.ovirt.org/#/c/9560/5/vdsm_api/vdsmapi-schema.json and its parent is to have a linkActive boolean that is true for link status 'up' and false for link status 'down'. We want to support a none (dummy) network that is used to dissociate vnics from any real network. The semantics, as you can see in the patch are that unless you specify a network, updateDevice will place the interface on that network. However, Adam Litke argues that not specifying a network should keep the vnic on the network it currently is, as network is an optional parameter and 'linkActive' is also optional and has this preserve current state semantics. I can certainly see the merit of what Adam proposes, and the implementation would be that linkActive becomes an enum like so: {'enum': 'linkState'/* or linkActive */ , 'data': ['up', 'down', 'disconnected']} With this change, network would only be changed if one different than the current one is specified and the vnic would be taken to the dummy bridge when the linkState would be set to 'disconnected'. There is also an objection, raised by Adam about the semantics of portMirroring. The current behavior from my patch is: portMirroring is None or is not set - No action taken. portMirroring = [] - No action taken. portMirroring = [a,b,z] - Set port mirroring for nets a,b and z to the specified vnic. His proposal is: portMirroring is None or is not set - No action taken. portMirroring = [] - Unset port mirroring to the vnic that is currently set. portMirroring = [a,b,z] - Set port mirroring for nets a,b and z to the specified vnic. I would really welcome comments on this to have finally an agreement to the api for this feature. +1 to the updated proposal. Is there any better way to do it? -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
[vdsm] RFC: New Storage API
I've been throwing a lot of bits out about the new storage API and I think it's time to talk a bit. I will purposefully try and keep implementation details away and concentrate about how the API looks and how you use it. First major change is in terminology, there is no long a storage domain but a storage repository. This change is done because so many things are already called domain in the system and this will make things less confusing for new-commers with a libvirt background. One other changes is that repositories no longer have a UUID. The UUID was only used in the pool members manifest and is no longer needed. connectStorageRepository(repoId, repoFormat, connectionParameters={}): repoId - is a transient name that will be used to refer to the connected domain, it is not persisted and doesn't have to be the same across the cluster. repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2). connectionParameters - This is format specific and will used to tell VDSM how to connect to the repo. disconnectStorageRepository(self, repoId): In the new API there are only images, some images are mutable and some are not. mutable images are also called VirtualDisks immutable images are also called Snapshots There are no explicit templates, you can create as many images as you want from any snapshot. There are 4 major image operations: createVirtualDisk(targetRepoId, size, baseSnapshotId=None, userData={}, options={}): targetRepoId - ID of a connected repo where the disk will be created size - The size of the image you wish to create baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on userData - optional data that will be attached to the new VD, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new VD createSnapshot(targetRepoId, baseVirtualDiskId, userData={}, options={}): targetRepoId - The ID of a connected repo where the new sanpshot will be created and the original image exists as well. size - The size of the image you wish to create baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want to snapshot userData - optional data that will be attached to the new Snapshot, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new Snapshot copyImage(targetRepoId, imageId, baseImageId=None, userData={}, options={}) targetRepoId - The ID of a connected repo where the new image will be created imageId - The image you wish to copy baseImageId - if specified, the new image will contain only the diff between image and Id. If None the new image will contain all the bits of image Id. This can be used to copy partial parts of images for export. userData - optional data that will be attached to the new image, could be anything that the user desires. options - options to modify VDSMs default behavior return the Id of the new image. In case of copying an immutable image the ID will be identical to the original image as they contain the same data. However the user should not assume that and always use the value returned from the method. removeImage(repositoryId, imageId, options={}): repositoryId - The ID of a connected repo where the image to delete resides imageId - The id of the image you wish to delete. getImageStatus(repositoryId, imageId) repositoryId - The ID of a connected repo where the image to check resides imageId - The id of the image you wish to check. All operations return once the operations has been committed to disk NOT when the operation actually completes. This is done so that: - operation come to a stable state as quickly as possible. - In case where there is an SDM, only small portion of the operation actually needs to be performed on the SDM host. - No matter how many times the operation fails and on how many hosts, you can always resume the operation and choose when to do it. - You can stop an operation at any time and remove the resulting object making a distinction between stop because the host is overloaded to I don't want that image This means that after calling any operation that creates a new image the user must then call getImageStatus() to check what is the status of the image. The status of the image can be either optimized, degraded, or broken. Optimized means that the image is available and you can run VMs of it. Degraded means that the image is available and will run VMs but it might be a better way VDSM can represent the underlying data. Broken means that the image can't be used at the moment, probably because not all the data has been set up on the volume. Apart from that VDSM will also return the last persisted status information which will conatin hostID - the last host to try and optimize of fix the image stage - X/Y (eg. 1/10) the last persisted stage of the fix. percent_complete - -1 or 0-100, the
Re: [vdsm] RFC: New Storage API
Thanks for sharing this. It's nice to have something a little more concrete to think about. Just a few comments and questions inline to get some discussion flowing. On Tue, Dec 04, 2012 at 04:52:40PM -0500, Saggi Mizrahi wrote: I've been throwing a lot of bits out about the new storage API and I think it's time to talk a bit. I will purposefully try and keep implementation details away and concentrate about how the API looks and how you use it. First major change is in terminology, there is no long a storage domain but a storage repository. This change is done because so many things are already called domain in the system and this will make things less confusing for new-commers with a libvirt background. One other changes is that repositories no longer have a UUID. The UUID was only used in the pool members manifest and is no longer needed. connectStorageRepository(repoId, repoFormat, connectionParameters={}): We should probably add an options/flags parameter for extension of all new APIs. repoId - is a transient name that will be used to refer to the connected domain, it is not persisted and doesn't have to be the same across the cluster. repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2). connectionParameters - This is format specific and will used to tell VDSM how to connect to the repo. disconnectStorageRepository(self, repoId): I assume 'self' is a mistake here. Just want to clarify given all of the recent talk about instances vs. namespaces. In the new API there are only images, some images are mutable and some are not. mutable images are also called VirtualDisks immutable images are also called Snapshots By mutable you mean writable right? Or does the word mutable imply more than that? There are no explicit templates, you can create as many images as you want from any snapshot. There are 4 major image operations: createVirtualDisk(targetRepoId, size, baseSnapshotId=None, userData={}, options={}): Is userdata a 'StringMap'? I will reopen the argument about an options dict vs a flags parameter. I oppose the dict for expansion because I think it causes APIs to devolve into a mess where lots of arbitrary and not well thought out overrides are packed into the dict over time. A flags argument (in json and python it can be an enum array) limits us to really switching flags on and off instead of passing arbitrary data. targetRepoId - ID of a connected repo where the disk will be created size - The size of the image you wish to create baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on userData - optional data that will be attached to the new VD, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new VD createSnapshot(targetRepoId, baseVirtualDiskId, userData={}, options={}): targetRepoId - The ID of a connected repo where the new sanpshot will be created and the original image exists as well. size - The size of the image you wish to create Why is this needed? Doesn't the size of a snapshot have to be equal to its base image? baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want to snapshot Can you snapshot a snapshot? In that case, this parameter should be called baseImage. userData - optional data that will be attached to the new Snapshot, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new Snapshot copyImage(targetRepoId, imageId, baseImageId=None, userData={}, options={}) targetRepoId - The ID of a connected repo where the new image will be created imageId - The image you wish to copy Do we locate the sourceRepoId automatically based on the imageId? baseImageId - if specified, the new image will contain only the diff between image and Id. If None the new image will contain all the bits of image Id. This can be used to copy partial parts of images for export. userData - optional data that will be attached to the new image, could be anything that the user desires. options - options to modify VDSMs default behavior return the Id of the new image. In case of copying an immutable image the ID will be identical to the original image as they contain the same data. However the user should not assume that and always use the value returned from the method. removeImage(repositoryId, imageId, options={}): repositoryId - The ID of a connected repo where the image to delete resides imageId - The id of the image you wish to delete. getImageStatus(repositoryId, imageId) repositoryId - The ID of a connected repo where the image to check resides imageId - The id of the image you wish to check. What is in this return value? Is it a single enum indicating whether the image is locked (being copied, etc.) or a list of detailed
Re: [vdsm] RFC: New Storage API
- Original Message - From: Adam Litke a...@us.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org, engine-devel engine-de...@ovirt.org Sent: Tuesday, December 4, 2012 6:08:25 PM Subject: Re: [vdsm] RFC: New Storage API Thanks for sharing this. It's nice to have something a little more concrete to think about. Just a few comments and questions inline to get some discussion flowing. On Tue, Dec 04, 2012 at 04:52:40PM -0500, Saggi Mizrahi wrote: I've been throwing a lot of bits out about the new storage API and I think it's time to talk a bit. I will purposefully try and keep implementation details away and concentrate about how the API looks and how you use it. First major change is in terminology, there is no long a storage domain but a storage repository. This change is done because so many things are already called domain in the system and this will make things less confusing for new-commers with a libvirt background. One other changes is that repositories no longer have a UUID. The UUID was only used in the pool members manifest and is no longer needed. connectStorageRepository(repoId, repoFormat, connectionParameters={}): We should probably add an options/flags parameter for extension of all new APIs. Usually I agree but connectionParameters is already generic enough :) repoId - is a transient name that will be used to refer to the connected domain, it is not persisted and doesn't have to be the same across the cluster. repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2). connectionParameters - This is format specific and will used to tell VDSM how to connect to the repo. disconnectStorageRepository(self, repoId): I assume 'self' is a mistake here. Just want to clarify given all of the recent talk about instances vs. namespaces. Yea, it's just pasted from my code In the new API there are only images, some images are mutable and some are not. mutable images are also called VirtualDisks immutable images are also called Snapshots By mutable you mean writable right? Or does the word mutable imply more than that? It's a semantic distinction due to implementation details, in general terms, yes. There are no explicit templates, you can create as many images as you want from any snapshot. There are 4 major image operations: createVirtualDisk(targetRepoId, size, baseSnapshotId=None, userData={}, options={}): Is userdata a 'StringMap'? currently it's a json object. We could limit it to a string map and trust the client to parse types. We can have it be a string\blob and trust the user to serialize the data. It's pass-through object either way. I will reopen the argument about an options dict vs a flags parameter. I oppose the dict for expansion because I think it causes APIs to devolve into a mess where lots of arbitrary and not well thought out overrides are packed into the dict over time. A flags argument (in json and python it can be an enum array) limits us to really switching flags on and off instead of passing arbitrary data. We already have strategy that we know we want to have several options. Other stuff that have been suggested is to be able to override the img format (qcow2\qed) The way I envision it is having an class opts = CommandOptions() that you add opts.addStringOption(key, value) opts.addIntOption(key, 3) opt.addBoolOption(key, True) I know you could just as well have strategy_space_flag and strategy_performance_flag and fail the operation if they both exist. Since it is a matter of personal taste I think it should be decided by a vote. targetRepoId - ID of a connected repo where the disk will be created size - The size of the image you wish to create baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on userData - optional data that will be attached to the new VD, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new VD createSnapshot(targetRepoId, baseVirtualDiskId, userData={}, options={}): targetRepoId - The ID of a connected repo where the new sanpshot will be created and the original image exists as well. size - The size of the image you wish to create Why is this needed? Doesn't the size of a snapshot have to be equal to its base image? oops, another copy\paste error, you can see this arg doesn't exist in the method signature. My proof reading do need more work. baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want to snapshot Can you snapshot a snapshot? In that case, this parameter should be called baseImage. You can't snapshot a sanpshot, it makes no sense as it can't change and you will get the same object. userData - optional data that will be