Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
+ * Vendor driver should decide whether to partition data section and how to + * partition the data section. Vendor driver should return data_offset + * accordingly. + * + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase + * and for _SAVING device state or stop-and-copy phase: + * a. read pending_bytes. If pending_bytes > 0, go through below steps. + * b. read data_offset, indicates kernel driver to write data to staging buffer. + *Kernel driver should return this read operation only after writing data to + *staging buffer is done. May I know under what condition this data_offset will be changed per each iteration from a-f ? Its upto vendor driver, if vendor driver maintains multiple partitions in data section. So, do you mean each time before doing b (reading data_offset), step a (reading pending_bytes) is mandatory, otherwise the vendor driver does not know which data_offset is? pending_bytes will only tell bytes remaining to transfer from vendor driver. On read operation on data_offset, vendor driver should decide what to send depending on where he is making data available to userspace. Then, any lock to wrap step a and b to ensure atomic? With current QEMU implementation, where migration is single thread, there is not need of lock yet. But when we add multi-threaded support may be in future then locks will be required in userspace side. Thanks, Kirti
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On 11/14/2019 2:10 AM, Alex Williamson wrote: On Thu, 14 Nov 2019 01:47:04 +0530 Kirti Wankhede wrote: On 11/14/2019 1:18 AM, Alex Williamson wrote: On Thu, 14 Nov 2019 00:59:52 +0530 Kirti Wankhede wrote: On 11/13/2019 11:57 PM, Alex Williamson wrote: On Wed, 13 Nov 2019 11:24:17 +0100 Cornelia Huck wrote: On Tue, 12 Nov 2019 15:30:05 -0700 Alex Williamson wrote: On Tue, 12 Nov 2019 22:33:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Used 3 bits to define VFIO device states. Bit 0 => _RUNNING Bit 1 => _SAVING Bit 2 => _RESUMING Combination of these bits defines VFIO device's state during migration _RUNNING => Normal VFIO device running state. When its reset, it indicates _STOPPED state. when device is changed to _STOPPED, driver should stop device before write() returns. _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start saving state of device i.e. pre-copy state _SAVING => vCPUs are stopped, VFIO device should be stopped, and s/should/must/ save device state,i.e. stop-n-copy state _RESUMING => VFIO device resuming state. _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states A table might be useful here and in the uapi header to indicate valid states: I like that. | _RESUMING | _SAVING | _RUNNING | Description +---+-+--+-- | 0 |0| 0| Stopped, not saving or resuming (a) +---+-+--+-- | 0 |0| 1| Running, default state +---+-+--+-- | 0 |1| 0| Stopped, migration interface in save mode +---+-+--+-- | 0 |1| 1| Running, save mode interface, iterative +---+-+--+-- | 1 |0| 0| Stopped, migration resume interface active +---+-+--+-- | 1 |0| 1| Invalid (b) +---+-+--+-- | 1 |1| 0| Invalid (c) +---+-+--+-- | 1 |1| 1| Invalid (d) I think we need to consider whether we define (a) as generally available, for instance we might want to use it for diagnostics or a fatal error condition outside of migration. Are there hidden assumptions between state transitions here or are there specific next possible state diagrams that we need to include as well? Some kind of state-change diagram might be useful in addition to the textual description anyway. Let me try, just to make sure I understand this correctly: During User application initialization, there is one more state change: 0) 0/0/0 stop to running -> 0/0/1 0/0/0 cannot be the initial state of the device, that would imply that a device supporting this migration interface breaks backwards compatibility with all existing vfio userspace code and that code needs to learn to set the device running as part of its initialization. That's absolutely unacceptable. The initial device state must be 0/0/1. There isn't any device state for all existing vfio userspace code right now. So default its assumed to be always running. Exactly, there is no representation of device state, therefore it's assumed to be running, therefore when adding a representation of device state it must default to running. With migration support, device states are explicitly getting added. For example, in case of QEMU, while device is getting initialized, i.e. from vfio_realize(), device_state is set to 0/0/0, but not required to convey it to vendor driver. But we have a 0/0/0 state, why would we intentionally keep an internal state that's inconsistent with the device? Then with vfio_vmstate_change() notifier, device state is changed to 0/0/1 when VM/vCPU are transitioned to running, at this moment device state is conveyed to vendor driver. So vendor driver doesn't see 0/0/0 state. But the running state is the state of the device, not the VM or the vCPU. Sure we might want to stop the device if the VM/vCPU state is stopped, but we must accept that the device is running when it's opened and we shouldn't intentionally maintain inconsistent state. While resuming, for userspace, for example QEMU, device state change is from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic initialization is done. I don't see why this matters, all device_state transitions are written directly to the
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, Nov 14, 2019 at 03:02:55AM +0800, Kirti Wankhede wrote: > > > On 11/13/2019 8:53 AM, Yan Zhao wrote: > > On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote: > >> On Tue, 12 Nov 2019 22:33:36 +0530 > >> Kirti Wankhede wrote: > >> > >>> - Defined MIGRATION region type and sub-type. > >>> - Used 3 bits to define VFIO device states. > >>> Bit 0 => _RUNNING > >>> Bit 1 => _SAVING > >>> Bit 2 => _RESUMING > >>> Combination of these bits defines VFIO device's state during > >>> migration > >>> _RUNNING => Normal VFIO device running state. When its reset, it > >>> indicates _STOPPED state. when device is changed to > >>> _STOPPED, driver should stop device before write() > >>> returns. > >>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > >>>start saving state of device i.e. pre-copy > >>> state > >>> _SAVING => vCPUs are stopped, VFIO device should be stopped, and > >> > >> s/should/must/ > >> > >>> save device state,i.e. stop-n-copy state > >>> _RESUMING => VFIO device resuming state. > >>> _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > >> > >> A table might be useful here and in the uapi header to indicate valid > >> states: > >> > >> | _RESUMING | _SAVING | _RUNNING | Description > >> +---+-+--+-- > >> | 0 |0| 0| Stopped, not saving or resuming (a) > >> +---+-+--+-- > >> | 0 |0| 1| Running, default state > >> +---+-+--+-- > >> | 0 |1| 0| Stopped, migration interface in save > >> mode > >> +---+-+--+-- > >> | 0 |1| 1| Running, save mode interface, iterative > >> +---+-+--+-- > >> | 1 |0| 0| Stopped, migration resume interface > >> active > >> +---+-+--+-- > >> | 1 |0| 1| Invalid (b) > >> +---+-+--+-- > >> | 1 |1| 0| Invalid (c) > >> +---+-+--+-- > >> | 1 |1| 1| Invalid (d) > >> > >> I think we need to consider whether we define (a) as generally > >> available, for instance we might want to use it for diagnostics or a > >> fatal error condition outside of migration. > >> > > We have to set it as init state. I'll add this it. > > >> Are there hidden assumptions between state transitions here or are > >> there specific next possible state diagrams that we need to include as > >> well? > >> > >> I'm curious if Intel agrees with the states marked invalid with their > >> push for post-copy support. > >> > > hi Alex and Kirti, > > Actually, for postcopy, I think we anyway need an extra POSTCOPY state > > introduced. Reasons as below: > > - in the target side, _RSESUMING state is set in the beginning of > >migration. It cannot be used as a state to inform device of that > >currently it's in postcopy state and device DMAs are to be trapped and > >pre-faulted. > >we also cannot use state (_RESUMING + _RUNNING) as an indicator of > >postcopy, because before device & vm running in target side, some device > >state are already loaded (e.g. page tables, pending workloads), > >target side can do pre-pagefault at that period before target vm up. > > - in the source side, after device is stopped, postcopy needs saving > >device state only (as compared to device state + remaining dirty > >pages in precopy). state (!_RUNNING + _SAVING) here again cannot > >differentiate precopy and postcopy here. > > > >>> Bits 3 - 31 are reserved for future use. User should perform > >>> read-modify-write operation on this field. > >>> - Defined vfio_device_migration_info structure which will be placed at 0th > >>>offset of migration region to get/set VFIO device related information. > >>>Defined members of structure and usage on read/write access: > >>> * device_state: (read/write) > >>> To convey VFIO device state to be transitioned to. Only 3 bits > >>> are > >>> used as of now, Bits 3 - 31 are reserved for future use. > >>> * pending bytes: (read only) > >>> To get pending bytes yet to be migrated for VFIO device. > >>> * data_offset: (read only) > >>> To get data offset in migration region from where data exist > >>> during _SAVING and from where data should be written by user space > >>> application during _RESUMING state. > >>> * data_size:
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, 14 Nov 2019 01:47:04 +0530 Kirti Wankhede wrote: > On 11/14/2019 1:18 AM, Alex Williamson wrote: > > On Thu, 14 Nov 2019 00:59:52 +0530 > > Kirti Wankhede wrote: > > > >> On 11/13/2019 11:57 PM, Alex Williamson wrote: > >>> On Wed, 13 Nov 2019 11:24:17 +0100 > >>> Cornelia Huck wrote: > >>> > On Tue, 12 Nov 2019 15:30:05 -0700 > Alex Williamson wrote: > > > On Tue, 12 Nov 2019 22:33:36 +0530 > > Kirti Wankhede wrote: > > > >> - Defined MIGRATION region type and sub-type. > >> - Used 3 bits to define VFIO device states. > >> Bit 0 => _RUNNING > >> Bit 1 => _SAVING > >> Bit 2 => _RESUMING > >> Combination of these bits defines VFIO device's state during > >> migration > >> _RUNNING => Normal VFIO device running state. When its reset, it > >>indicates _STOPPED state. when device is changed to > >>_STOPPED, driver should stop device before write() > >>returns. > >> _SAVING | _RUNNING => vCPUs are running, VFIO device is running > >> but > >> start saving state of device i.e. pre-copy > >> state > >> _SAVING => vCPUs are stopped, VFIO device should be stopped, > >> and > > > > s/should/must/ > > > >> save device state,i.e. stop-n-copy state > >> _RESUMING => VFIO device resuming state. > >> _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > > > > A table might be useful here and in the uapi header to indicate valid > > states: > > I like that. > > > > > | _RESUMING | _SAVING | _RUNNING | Description > > +---+-+--+-- > > | 0 |0| 0| Stopped, not saving or resuming (a) > > +---+-+--+-- > > | 0 |0| 1| Running, default state > > +---+-+--+-- > > | 0 |1| 0| Stopped, migration interface in save > > mode > > +---+-+--+-- > > | 0 |1| 1| Running, save mode interface, > > iterative > > +---+-+--+-- > > | 1 |0| 0| Stopped, migration resume interface > > active > > +---+-+--+-- > > | 1 |0| 1| Invalid (b) > > +---+-+--+-- > > | 1 |1| 0| Invalid (c) > > +---+-+--+-- > > | 1 |1| 1| Invalid (d) > > > > I think we need to consider whether we define (a) as generally > > available, for instance we might want to use it for diagnostics or a > > fatal error condition outside of migration. > > > > Are there hidden assumptions between state transitions here or are > > there specific next possible state diagrams that we need to include as > > well? > > Some kind of state-change diagram might be useful in addition to the > textual description anyway. Let me try, just to make sure I understand > this correctly: > > >> > >> During User application initialization, there is one more state change: > >> > >> 0) 0/0/0 stop to running -> 0/0/1 > > > > 0/0/0 cannot be the initial state of the device, that would imply that > > a device supporting this migration interface breaks backwards > > compatibility with all existing vfio userspace code and that code needs > > to learn to set the device running as part of its initialization. > > That's absolutely unacceptable. The initial device state must be 0/0/1. > > > > There isn't any device state for all existing vfio userspace code right > now. So default its assumed to be always running. Exactly, there is no representation of device state, therefore it's assumed to be running, therefore when adding a representation of device state it must default to running. > With migration support, device states are explicitly getting added. For > example, in case of QEMU, while device is getting initialized, i.e. from > vfio_realize(), device_state is set to 0/0/0, but not required to convey > it to vendor driver. But we have a 0/0/0 state, why would we intentionally keep an internal state that's inconsistent with the device? > Then with vfio_vmstate_change() notifier, device > state is changed to 0/0/1 when VM/vCPU are transitioned to running, at > this moment device state is
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On 11/14/2019 1:18 AM, Alex Williamson wrote: On Thu, 14 Nov 2019 00:59:52 +0530 Kirti Wankhede wrote: On 11/13/2019 11:57 PM, Alex Williamson wrote: On Wed, 13 Nov 2019 11:24:17 +0100 Cornelia Huck wrote: On Tue, 12 Nov 2019 15:30:05 -0700 Alex Williamson wrote: On Tue, 12 Nov 2019 22:33:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Used 3 bits to define VFIO device states. Bit 0 => _RUNNING Bit 1 => _SAVING Bit 2 => _RESUMING Combination of these bits defines VFIO device's state during migration _RUNNING => Normal VFIO device running state. When its reset, it indicates _STOPPED state. when device is changed to _STOPPED, driver should stop device before write() returns. _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start saving state of device i.e. pre-copy state _SAVING => vCPUs are stopped, VFIO device should be stopped, and s/should/must/ save device state,i.e. stop-n-copy state _RESUMING => VFIO device resuming state. _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states A table might be useful here and in the uapi header to indicate valid states: I like that. | _RESUMING | _SAVING | _RUNNING | Description +---+-+--+-- | 0 |0| 0| Stopped, not saving or resuming (a) +---+-+--+-- | 0 |0| 1| Running, default state +---+-+--+-- | 0 |1| 0| Stopped, migration interface in save mode +---+-+--+-- | 0 |1| 1| Running, save mode interface, iterative +---+-+--+-- | 1 |0| 0| Stopped, migration resume interface active +---+-+--+-- | 1 |0| 1| Invalid (b) +---+-+--+-- | 1 |1| 0| Invalid (c) +---+-+--+-- | 1 |1| 1| Invalid (d) I think we need to consider whether we define (a) as generally available, for instance we might want to use it for diagnostics or a fatal error condition outside of migration. Are there hidden assumptions between state transitions here or are there specific next possible state diagrams that we need to include as well? Some kind of state-change diagram might be useful in addition to the textual description anyway. Let me try, just to make sure I understand this correctly: During User application initialization, there is one more state change: 0) 0/0/0 stop to running -> 0/0/1 0/0/0 cannot be the initial state of the device, that would imply that a device supporting this migration interface breaks backwards compatibility with all existing vfio userspace code and that code needs to learn to set the device running as part of its initialization. That's absolutely unacceptable. The initial device state must be 0/0/1. There isn't any device state for all existing vfio userspace code right now. So default its assumed to be always running. With migration support, device states are explicitly getting added. For example, in case of QEMU, while device is getting initialized, i.e. from vfio_realize(), device_state is set to 0/0/0, but not required to convey it to vendor driver. Then with vfio_vmstate_change() notifier, device state is changed to 0/0/1 when VM/vCPU are transitioned to running, at this moment device state is conveyed to vendor driver. So vendor driver doesn't see 0/0/0 state. While resuming, for userspace, for example QEMU, device state change is from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic initialization is done. 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 not just gathering state info, but also copy device state to be transferred during pre-copy phase. Below 2 state are not just to tell driver to stop, those 2 differ. 2) is device state changed from running to stop, this is when VM shutdowns cleanly, no need to save device state Userspace is under no obligation to perform this state change though, backwards compatibility dictates this. 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 above is transition from pre-copy phase to stop-and-copy phase, where device data should be made available to user to transfer to destination or to save it to file in case of save VM or suspend. 4) 0/0/1 ---(tell driver to
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, 14 Nov 2019 00:59:52 +0530 Kirti Wankhede wrote: > On 11/13/2019 11:57 PM, Alex Williamson wrote: > > On Wed, 13 Nov 2019 11:24:17 +0100 > > Cornelia Huck wrote: > > > >> On Tue, 12 Nov 2019 15:30:05 -0700 > >> Alex Williamson wrote: > >> > >>> On Tue, 12 Nov 2019 22:33:36 +0530 > >>> Kirti Wankhede wrote: > >>> > - Defined MIGRATION region type and sub-type. > - Used 3 bits to define VFIO device states. > Bit 0 => _RUNNING > Bit 1 => _SAVING > Bit 2 => _RESUMING > Combination of these bits defines VFIO device's state during > migration > _RUNNING => Normal VFIO device running state. When its reset, it > indicates _STOPPED state. when device is changed to > _STOPPED, driver should stop device before write() > returns. > _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > start saving state of device i.e. pre-copy > state > _SAVING => vCPUs are stopped, VFIO device should be stopped, and > >>> > >>> s/should/must/ > >>> > save device state,i.e. stop-n-copy state > _RESUMING => VFIO device resuming state. > _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > >>> > >>> A table might be useful here and in the uapi header to indicate valid > >>> states: > >> > >> I like that. > >> > >>> > >>> | _RESUMING | _SAVING | _RUNNING | Description > >>> +---+-+--+-- > >>> | 0 |0| 0| Stopped, not saving or resuming (a) > >>> +---+-+--+-- > >>> | 0 |0| 1| Running, default state > >>> +---+-+--+-- > >>> | 0 |1| 0| Stopped, migration interface in save > >>> mode > >>> +---+-+--+-- > >>> | 0 |1| 1| Running, save mode interface, iterative > >>> +---+-+--+-- > >>> | 1 |0| 0| Stopped, migration resume interface > >>> active > >>> +---+-+--+-- > >>> | 1 |0| 1| Invalid (b) > >>> +---+-+--+-- > >>> | 1 |1| 0| Invalid (c) > >>> +---+-+--+-- > >>> | 1 |1| 1| Invalid (d) > >>> > >>> I think we need to consider whether we define (a) as generally > >>> available, for instance we might want to use it for diagnostics or a > >>> fatal error condition outside of migration. > >>> > >>> Are there hidden assumptions between state transitions here or are > >>> there specific next possible state diagrams that we need to include as > >>> well? > >> > >> Some kind of state-change diagram might be useful in addition to the > >> textual description anyway. Let me try, just to make sure I understand > >> this correctly: > >> > > During User application initialization, there is one more state change: > > 0) 0/0/0 stop to running -> 0/0/1 0/0/0 cannot be the initial state of the device, that would imply that a device supporting this migration interface breaks backwards compatibility with all existing vfio userspace code and that code needs to learn to set the device running as part of its initialization. That's absolutely unacceptable. The initial device state must be 0/0/1. > >> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 > > not just gathering state info, but also copy device state to be > transferred during pre-copy phase. > > Below 2 state are not just to tell driver to stop, those 2 differ. > 2) is device state changed from running to stop, this is when VM > shutdowns cleanly, no need to save device state Userspace is under no obligation to perform this state change though, backwards compatibility dictates this. > >> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 > > >> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 > > above is transition from pre-copy phase to stop-and-copy phase, where > device data should be made available to user to transfer to destination > or to save it to file in case of save VM or suspend. > > > >> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0 > > > > I think this is to switch into resuming mode, the data will follow > > >> 5) 1/0/0 ---(driver is ready)---> 0/0/1 > >> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1 > > > > above can occur on migration cancelled or failed. > > > > I think also: > > > > 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On 11/13/2019 11:57 PM, Alex Williamson wrote: On Wed, 13 Nov 2019 11:24:17 +0100 Cornelia Huck wrote: On Tue, 12 Nov 2019 15:30:05 -0700 Alex Williamson wrote: On Tue, 12 Nov 2019 22:33:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Used 3 bits to define VFIO device states. Bit 0 => _RUNNING Bit 1 => _SAVING Bit 2 => _RESUMING Combination of these bits defines VFIO device's state during migration _RUNNING => Normal VFIO device running state. When its reset, it indicates _STOPPED state. when device is changed to _STOPPED, driver should stop device before write() returns. _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start saving state of device i.e. pre-copy state _SAVING => vCPUs are stopped, VFIO device should be stopped, and s/should/must/ save device state,i.e. stop-n-copy state _RESUMING => VFIO device resuming state. _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states A table might be useful here and in the uapi header to indicate valid states: I like that. | _RESUMING | _SAVING | _RUNNING | Description +---+-+--+-- | 0 |0| 0| Stopped, not saving or resuming (a) +---+-+--+-- | 0 |0| 1| Running, default state +---+-+--+-- | 0 |1| 0| Stopped, migration interface in save mode +---+-+--+-- | 0 |1| 1| Running, save mode interface, iterative +---+-+--+-- | 1 |0| 0| Stopped, migration resume interface active +---+-+--+-- | 1 |0| 1| Invalid (b) +---+-+--+-- | 1 |1| 0| Invalid (c) +---+-+--+-- | 1 |1| 1| Invalid (d) I think we need to consider whether we define (a) as generally available, for instance we might want to use it for diagnostics or a fatal error condition outside of migration. Are there hidden assumptions between state transitions here or are there specific next possible state diagrams that we need to include as well? Some kind of state-change diagram might be useful in addition to the textual description anyway. Let me try, just to make sure I understand this correctly: During User application initialization, there is one more state change: 0) 0/0/0 stop to running -> 0/0/1 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 not just gathering state info, but also copy device state to be transferred during pre-copy phase. Below 2 state are not just to tell driver to stop, those 2 differ. 2) is device state changed from running to stop, this is when VM shutdowns cleanly, no need to save device state 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 above is transition from pre-copy phase to stop-and-copy phase, where device data should be made available to user to transfer to destination or to save it to file in case of save VM or suspend. 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0 I think this is to switch into resuming mode, the data will follow > 5) 1/0/0 ---(driver is ready)---> 0/0/1 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1 above can occur on migration cancelled or failed. I think also: 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy that's right, this happens in case of save VM or suspend VM. 0/0/0 and 0/0/1 should be reachable from any state, though I could see that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the received state is incomplete. Somehow though a user always needs to return the device to the initial state, so how does device_state interact with the reset ioctl? Would this automatically manipulate device_state back to 0/0/1? why would reset occur on 1/0/0 -> 0/0/1 failure? 1/0/0 -> 0/0/1 fails, then user should convey that to source that migration has failed, then resume at source. Not sure about the usefulness of 2). I explained this above. Also, is 4) the only way to trigger resuming? Yes. And is the change in 5) performed by the driver, or by userspace? By userspace. Are any other state transitions valid? (...) + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase + * and for _SAVING device state or stop-and-copy phase: + * a. read
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On 11/13/2019 8:53 AM, Yan Zhao wrote: On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote: On Tue, 12 Nov 2019 22:33:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Used 3 bits to define VFIO device states. Bit 0 => _RUNNING Bit 1 => _SAVING Bit 2 => _RESUMING Combination of these bits defines VFIO device's state during migration _RUNNING => Normal VFIO device running state. When its reset, it indicates _STOPPED state. when device is changed to _STOPPED, driver should stop device before write() returns. _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start saving state of device i.e. pre-copy state _SAVING => vCPUs are stopped, VFIO device should be stopped, and s/should/must/ save device state,i.e. stop-n-copy state _RESUMING => VFIO device resuming state. _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states A table might be useful here and in the uapi header to indicate valid states: | _RESUMING | _SAVING | _RUNNING | Description +---+-+--+-- | 0 |0| 0| Stopped, not saving or resuming (a) +---+-+--+-- | 0 |0| 1| Running, default state +---+-+--+-- | 0 |1| 0| Stopped, migration interface in save mode +---+-+--+-- | 0 |1| 1| Running, save mode interface, iterative +---+-+--+-- | 1 |0| 0| Stopped, migration resume interface active +---+-+--+-- | 1 |0| 1| Invalid (b) +---+-+--+-- | 1 |1| 0| Invalid (c) +---+-+--+-- | 1 |1| 1| Invalid (d) I think we need to consider whether we define (a) as generally available, for instance we might want to use it for diagnostics or a fatal error condition outside of migration. We have to set it as init state. I'll add this it. Are there hidden assumptions between state transitions here or are there specific next possible state diagrams that we need to include as well? I'm curious if Intel agrees with the states marked invalid with their push for post-copy support. hi Alex and Kirti, Actually, for postcopy, I think we anyway need an extra POSTCOPY state introduced. Reasons as below: - in the target side, _RSESUMING state is set in the beginning of migration. It cannot be used as a state to inform device of that currently it's in postcopy state and device DMAs are to be trapped and pre-faulted. we also cannot use state (_RESUMING + _RUNNING) as an indicator of postcopy, because before device & vm running in target side, some device state are already loaded (e.g. page tables, pending workloads), target side can do pre-pagefault at that period before target vm up. - in the source side, after device is stopped, postcopy needs saving device state only (as compared to device state + remaining dirty pages in precopy). state (!_RUNNING + _SAVING) here again cannot differentiate precopy and postcopy here. Bits 3 - 31 are reserved for future use. User should perform read-modify-write operation on this field. - Defined vfio_device_migration_info structure which will be placed at 0th offset of migration region to get/set VFIO device related information. Defined members of structure and usage on read/write access: * device_state: (read/write) To convey VFIO device state to be transitioned to. Only 3 bits are used as of now, Bits 3 - 31 are reserved for future use. * pending bytes: (read only) To get pending bytes yet to be migrated for VFIO device. * data_offset: (read only) To get data offset in migration region from where data exist during _SAVING and from where data should be written by user space application during _RESUMING state. * data_size: (read/write) To get and set size in bytes of data copied in migration region during _SAVING and _RESUMING state. Migration region looks like: -- |vfio_device_migration_info|data section | | | /// | -- ^ ^ offset 0-trapped part
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Wed, 13 Nov 2019 11:24:17 +0100 Cornelia Huck wrote: > On Tue, 12 Nov 2019 15:30:05 -0700 > Alex Williamson wrote: > > > On Tue, 12 Nov 2019 22:33:36 +0530 > > Kirti Wankhede wrote: > > > > > - Defined MIGRATION region type and sub-type. > > > - Used 3 bits to define VFIO device states. > > > Bit 0 => _RUNNING > > > Bit 1 => _SAVING > > > Bit 2 => _RESUMING > > > Combination of these bits defines VFIO device's state during migration > > > _RUNNING => Normal VFIO device running state. When its reset, it > > > indicates _STOPPED state. when device is changed to > > > _STOPPED, driver should stop device before write() > > > returns. > > > _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > > > start saving state of device i.e. pre-copy state > > > _SAVING => vCPUs are stopped, VFIO device should be stopped, and > > > > s/should/must/ > > > > > save device state,i.e. stop-n-copy state > > > _RESUMING => VFIO device resuming state. > > > _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > > > > A table might be useful here and in the uapi header to indicate valid > > states: > > I like that. > > > > > | _RESUMING | _SAVING | _RUNNING | Description > > +---+-+--+-- > > | 0 |0| 0| Stopped, not saving or resuming (a) > > +---+-+--+-- > > | 0 |0| 1| Running, default state > > +---+-+--+-- > > | 0 |1| 0| Stopped, migration interface in save mode > > +---+-+--+-- > > | 0 |1| 1| Running, save mode interface, iterative > > +---+-+--+-- > > | 1 |0| 0| Stopped, migration resume interface > > active > > +---+-+--+-- > > | 1 |0| 1| Invalid (b) > > +---+-+--+-- > > | 1 |1| 0| Invalid (c) > > +---+-+--+-- > > | 1 |1| 1| Invalid (d) > > > > I think we need to consider whether we define (a) as generally > > available, for instance we might want to use it for diagnostics or a > > fatal error condition outside of migration. > > > > Are there hidden assumptions between state transitions here or are > > there specific next possible state diagrams that we need to include as > > well? > > Some kind of state-change diagram might be useful in addition to the > textual description anyway. Let me try, just to make sure I understand > this correctly: > > 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 > 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 > 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 > 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0 I think this is to switch into resuming mode, the data will follow > 5) 1/0/0 ---(driver is ready)---> 0/0/1 > 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1 I think also: 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy 0/0/0 and 0/0/1 should be reachable from any state, though I could see that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the received state is incomplete. Somehow though a user always needs to return the device to the initial state, so how does device_state interact with the reset ioctl? Would this automatically manipulate device_state back to 0/0/1? > Not sure about the usefulness of 2). Also, is 4) the only way to > trigger resuming? And is the change in 5) performed by the driver, or > by userspace? > > Are any other state transitions valid? > > (...) > > > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy > > > phase > > > + * and for _SAVING device state or stop-and-copy phase: > > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps. > > > + * b. read data_offset, indicates kernel driver to write data to staging > > > buffer. > > > + *Kernel driver should return this read operation only after writing > > > data to > > > + *staging buffer is done. > > > > "staging buffer" implies a vendor driver implementation, perhaps we > > could just state that data is available from (region + data_offset) to > > (region + data_offset + data_size) upon return of this read operation. > > > > > + * c. read data_size, amount of data in bytes written by vendor driver in > > > + *migration region. > > > + * d. read data_size bytes of data from data_offset in the migration > > >
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 12 Nov 2019 15:30:05 -0700 Alex Williamson wrote: > On Tue, 12 Nov 2019 22:33:36 +0530 > Kirti Wankhede wrote: > > > - Defined MIGRATION region type and sub-type. > > - Used 3 bits to define VFIO device states. > > Bit 0 => _RUNNING > > Bit 1 => _SAVING > > Bit 2 => _RESUMING > > Combination of these bits defines VFIO device's state during migration > > _RUNNING => Normal VFIO device running state. When its reset, it > > indicates _STOPPED state. when device is changed to > > _STOPPED, driver should stop device before write() > > returns. > > _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > > start saving state of device i.e. pre-copy state > > _SAVING => vCPUs are stopped, VFIO device should be stopped, and > > s/should/must/ > > > save device state,i.e. stop-n-copy state > > _RESUMING => VFIO device resuming state. > > _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > > A table might be useful here and in the uapi header to indicate valid > states: I like that. > > | _RESUMING | _SAVING | _RUNNING | Description > +---+-+--+-- > | 0 |0| 0| Stopped, not saving or resuming (a) > +---+-+--+-- > | 0 |0| 1| Running, default state > +---+-+--+-- > | 0 |1| 0| Stopped, migration interface in save mode > +---+-+--+-- > | 0 |1| 1| Running, save mode interface, iterative > +---+-+--+-- > | 1 |0| 0| Stopped, migration resume interface active > +---+-+--+-- > | 1 |0| 1| Invalid (b) > +---+-+--+-- > | 1 |1| 0| Invalid (c) > +---+-+--+-- > | 1 |1| 1| Invalid (d) > > I think we need to consider whether we define (a) as generally > available, for instance we might want to use it for diagnostics or a > fatal error condition outside of migration. > > Are there hidden assumptions between state transitions here or are > there specific next possible state diagrams that we need to include as > well? Some kind of state-change diagram might be useful in addition to the textual description anyway. Let me try, just to make sure I understand this correctly: 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0 5) 1/0/0 ---(driver is ready)---> 0/0/1 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1 Not sure about the usefulness of 2). Also, is 4) the only way to trigger resuming? And is the change in 5) performed by the driver, or by userspace? Are any other state transitions valid? (...) > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy > > phase > > + * and for _SAVING device state or stop-and-copy phase: > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps. > > + * b. read data_offset, indicates kernel driver to write data to staging > > buffer. > > + *Kernel driver should return this read operation only after writing > > data to > > + *staging buffer is done. > > "staging buffer" implies a vendor driver implementation, perhaps we > could just state that data is available from (region + data_offset) to > (region + data_offset + data_size) upon return of this read operation. > > > + * c. read data_size, amount of data in bytes written by vendor driver in > > + *migration region. > > + * d. read data_size bytes of data from data_offset in the migration > > region. > > + * e. process data. > > + * f. Loop through a to e. Next read on pending_bytes indicates that read > > data > > + *operation from migration region for previous iteration is done. > > I think this indicate that step (f) should be to read pending_bytes, the > read sequence is not complete until this step. Optionally the user can > then proceed to step (b). There are no read side-effects of (a) afaict. > > Is the use required to reach pending_bytes == 0 before changing > device_state, particularly transitioning to !_RUNNING? Presumably the > user can exit this sequence at any time by clearing _SAVING. That would be transition 6) above (abort saving and continue). I think it makes sense not to forbid this. > > > + * > > + * Sequence to be followed
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote: > On Tue, 12 Nov 2019 22:33:36 +0530 > Kirti Wankhede wrote: > > > - Defined MIGRATION region type and sub-type. > > - Used 3 bits to define VFIO device states. > > Bit 0 => _RUNNING > > Bit 1 => _SAVING > > Bit 2 => _RESUMING > > Combination of these bits defines VFIO device's state during migration > > _RUNNING => Normal VFIO device running state. When its reset, it > > indicates _STOPPED state. when device is changed to > > _STOPPED, driver should stop device before write() > > returns. > > _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > > start saving state of device i.e. pre-copy state > > _SAVING => vCPUs are stopped, VFIO device should be stopped, and > > s/should/must/ > > > save device state,i.e. stop-n-copy state > > _RESUMING => VFIO device resuming state. > > _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > > A table might be useful here and in the uapi header to indicate valid > states: > > | _RESUMING | _SAVING | _RUNNING | Description > +---+-+--+-- > | 0 |0| 0| Stopped, not saving or resuming (a) > +---+-+--+-- > | 0 |0| 1| Running, default state > +---+-+--+-- > | 0 |1| 0| Stopped, migration interface in save mode > +---+-+--+-- > | 0 |1| 1| Running, save mode interface, iterative > +---+-+--+-- > | 1 |0| 0| Stopped, migration resume interface active > +---+-+--+-- > | 1 |0| 1| Invalid (b) > +---+-+--+-- > | 1 |1| 0| Invalid (c) > +---+-+--+-- > | 1 |1| 1| Invalid (d) > > I think we need to consider whether we define (a) as generally > available, for instance we might want to use it for diagnostics or a > fatal error condition outside of migration. > > Are there hidden assumptions between state transitions here or are > there specific next possible state diagrams that we need to include as > well? > > I'm curious if Intel agrees with the states marked invalid with their > push for post-copy support. > hi Alex and Kirti, Actually, for postcopy, I think we anyway need an extra POSTCOPY state introduced. Reasons as below: - in the target side, _RSESUMING state is set in the beginning of migration. It cannot be used as a state to inform device of that currently it's in postcopy state and device DMAs are to be trapped and pre-faulted. we also cannot use state (_RESUMING + _RUNNING) as an indicator of postcopy, because before device & vm running in target side, some device state are already loaded (e.g. page tables, pending workloads), target side can do pre-pagefault at that period before target vm up. - in the source side, after device is stopped, postcopy needs saving device state only (as compared to device state + remaining dirty pages in precopy). state (!_RUNNING + _SAVING) here again cannot differentiate precopy and postcopy here. > > Bits 3 - 31 are reserved for future use. User should perform > > read-modify-write operation on this field. > > - Defined vfio_device_migration_info structure which will be placed at 0th > > offset of migration region to get/set VFIO device related information. > > Defined members of structure and usage on read/write access: > > * device_state: (read/write) > > To convey VFIO device state to be transitioned to. Only 3 bits are > > used as of now, Bits 3 - 31 are reserved for future use. > > * pending bytes: (read only) > > To get pending bytes yet to be migrated for VFIO device. > > * data_offset: (read only) > > To get data offset in migration region from where data exist > > during _SAVING and from where data should be written by user space > > application during _RESUMING state. > > * data_size: (read/write) > > To get and set size in bytes of data copied in migration region > > during _SAVING and _RESUMING state. > > > > Migration region looks like: > > -- > > |vfio_device_migration_info|data section | > > | | /// | > > -- > > ^
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 12 Nov 2019 22:33:36 +0530 Kirti Wankhede wrote: > - Defined MIGRATION region type and sub-type. > - Used 3 bits to define VFIO device states. > Bit 0 => _RUNNING > Bit 1 => _SAVING > Bit 2 => _RESUMING > Combination of these bits defines VFIO device's state during migration > _RUNNING => Normal VFIO device running state. When its reset, it > indicates _STOPPED state. when device is changed to > _STOPPED, driver should stop device before write() > returns. > _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > start saving state of device i.e. pre-copy state > _SAVING => vCPUs are stopped, VFIO device should be stopped, and s/should/must/ > save device state,i.e. stop-n-copy state > _RESUMING => VFIO device resuming state. > _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states A table might be useful here and in the uapi header to indicate valid states: | _RESUMING | _SAVING | _RUNNING | Description +---+-+--+-- | 0 |0| 0| Stopped, not saving or resuming (a) +---+-+--+-- | 0 |0| 1| Running, default state +---+-+--+-- | 0 |1| 0| Stopped, migration interface in save mode +---+-+--+-- | 0 |1| 1| Running, save mode interface, iterative +---+-+--+-- | 1 |0| 0| Stopped, migration resume interface active +---+-+--+-- | 1 |0| 1| Invalid (b) +---+-+--+-- | 1 |1| 0| Invalid (c) +---+-+--+-- | 1 |1| 1| Invalid (d) I think we need to consider whether we define (a) as generally available, for instance we might want to use it for diagnostics or a fatal error condition outside of migration. Are there hidden assumptions between state transitions here or are there specific next possible state diagrams that we need to include as well? I'm curious if Intel agrees with the states marked invalid with their push for post-copy support. > Bits 3 - 31 are reserved for future use. User should perform > read-modify-write operation on this field. > - Defined vfio_device_migration_info structure which will be placed at 0th > offset of migration region to get/set VFIO device related information. > Defined members of structure and usage on read/write access: > * device_state: (read/write) > To convey VFIO device state to be transitioned to. Only 3 bits are > used as of now, Bits 3 - 31 are reserved for future use. > * pending bytes: (read only) > To get pending bytes yet to be migrated for VFIO device. > * data_offset: (read only) > To get data offset in migration region from where data exist > during _SAVING and from where data should be written by user space > application during _RESUMING state. > * data_size: (read/write) > To get and set size in bytes of data copied in migration region > during _SAVING and _RESUMING state. > > Migration region looks like: > -- > |vfio_device_migration_info|data section | > | | /// | > -- > ^ ^ > offset 0-trapped partdata_offset > > Structure vfio_device_migration_info is always followed by data section > in the region, so data_offset will always be non-0. Offset from where data > to be copied is decided by kernel driver, data section can be trapped or > mapped depending on how kernel driver defines data section. > Data section partition can be defined as mapped by sparse mmap capability. > If mmapped, then data_offset should be page aligned, where as initial > section which contain vfio_device_migration_info structure might not end > at offset which is page aligned. > Vendor driver should decide whether to partition data section and how to > partition the data section. Vendor driver should return data_offset > accordingly. > > For user application, data is opaque. User should write data in the same > order as received. > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > include/uapi/linux/vfio.h | 108 > ++ > 1 file changed, 108 insertions(+) > > diff