Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2019-09-26 Thread Spassov, Stanislav
Hello Pasi,

Unfortunately, I am not able to continue the work on the Xen patches in
the foreseeable future.

For what it's worth: the xen-pciback workaround from this thread solves
my current issue as confirmed by internal testing.

-- Stanislav

(apologies for ugly footer injected below by company SMTP server
due to local laws)

On Thu, 2019-09-26 at 13:13 +0300, Pasi Kärkkäinen wrote:
> Hello Stanislav,
> 
> On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
> > On Fri, Sep 13, 2019 at 10:02:24AM +, Spassov, Stanislav wrote:
> > > On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
> > > > On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
> > > > > > > > On 13.12.18 at 04:46,  wrote:
> > > > > > 
> > > > > > On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich
> > > > > > wrote:
> > > > > > > > > > On 12.12.18 at 16:18,  wrote:
> > > > > > > > 
> > > > > > > > On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich
> > > > > > > > wrote:
> > > > > > > > > > > > On 12.12.18 at 08:06, 
> > > > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris
> > > > > > > > > > Ostrovsky wrote:
> > > > > > > > > > > On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> > > > > > > > > > > > On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao
> > > > > > > > > > > > Gao wrote:
> > > > > > > > > > > > > I find some pass-thru devices don't work any
> > > > > > > > > > > > > more across guest reboot.
> > > > > > > > > > > > > Assigning it to another guest also meets the
> > > > > > > > > > > > > same issue. And the only
> > > > > > > > > > > > > way to make it work again is un-binding and
> > > > > > > > > > > > > binding it to pciback.
> > > > > > > > > > > > > Someone reported this issue one year ago [1].
> > > > > > > > > > > > > More detail also can be
> > > > > > > > > > > > > found in [2].
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The root-cause is Xen's internal MSI-X state
> > > > > > > > > > > > > isn't reset properly
> > > > > > > > > > > > > during reboot or re-assignment. In the above
> > > > > > > > > > > > > case, Xen set maskall bit
> > > > > > > > > > > > > to mask all MSI interrupts after it detected
> > > > > > > > > > > > > a potential security
> > > > > > > > > > > > > issue. Even after device reset, Xen didn't
> > > > > > > > > > > > > reset its internal maskall
> > > > > > > > > > > > > bit. As a result, maskall bit would be set
> > > > > > > > > > > > > again in next write to
> > > > > > > > > > > > > MSI-X message control register.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Given that PHYSDEVOPS_prepare_msix() also
> > > > > > > > > > > > > triggers Xen resetting MSI-X
> > > > > > > > > > > > > internal state of a device, we employ it to
> > > > > > > > > > > > > fix this issue rather than
> > > > > > > > > > > > > introducing another dedicated sub-hypercall.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Note that PHYSDEVOPS_release_msix() will fail
> > > > > > > > > > > > > if the mapping between
> > > > > > > > > > > > > the device's msix and pirq has been created.
> > > > > > > > > > > > > This limitation prevents
> > > > > > > > > > > > > us calling this function when detaching a
> > > > > > > > > > > > > device from a guest during
> > > > > > > > > > > > > guest shutdown. Thus it is called right
> > > > > > > > > > > > > before calling
> > > > > > > > > > > > > PHYSDEVOPS_prepare_msix().
> > > > > > > > > > > > 
> > > > > > > > > > > > s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then
> > > > > > > > > > > > I would also drop the
> > > > > > > > > > > > () at the end of the hypercall name since it's
> > > > > > > > > > > > not a function.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm also wondering why the release can't be
> > > > > > > > > > > > done when the device is
> > > > > > > > > > > > detached from the guest (or the guest has been
> > > > > > > > > > > > shut down). This makes
> > > > > > > > > > > > me worry about the raciness of the
> > > > > > > > > > > > attach/detach procedure: if there's
> > > > > > > > > > > > a state where pciback assumes the device has
> > > > > > > > > > > > been detached from the
> > > > > > > > > > > > guest, but there are still pirqs bound, an
> > > > > > > > > > > > attempt to attach to
> > > > > > > > > > > > another guest in such state will fail.
> > > > > > > > > > > 
> > > > > > > > > > > I wonder whether this additional reset
> > > > > > > > > > > functionality could be done out
> > > > > > > > > > > of xen_pcibk_xenbus_remove(). We first do a (best
> > > > > > > > > > > effort) device reset
> > > > > > > > > > > and then do the extra things that are not
> > > > > > > > > > > properly done there.
> > > > > > > > > > 
> > > > > > > > > > No. It cannot be done in xen_pcibk_xenbus_remove()
> > > > > > > > > > without modifying
> > > > > > > > > > the handler of PHYSDEVOP_release_msix. To do a
> > > > > > > > > > successful Xen internal
> > > > > > > > > > 

Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2019-09-26 Thread Pasi Kärkkäinen
Hello Stanislav,

On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
> On Fri, Sep 13, 2019 at 10:02:24AM +, Spassov, Stanislav wrote:
> >On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
> >>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
> >> On 13.12.18 at 04:46,  wrote:
>  On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>  On 12.12.18 at 16:18,  wrote:
> >> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
> >> On 12.12.18 at 08:06,  wrote:
>  On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
> >On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> >> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
> >>> I find some pass-thru devices don't work any more across guest 
> >>> reboot.
> >>> Assigning it to another guest also meets the same issue. And the 
> >>> only
> >>> way to make it work again is un-binding and binding it to pciback.
> >>> Someone reported this issue one year ago [1]. More detail also 
> >>> can be
> >>> found in [2].
> >>>
> >>> The root-cause is Xen's internal MSI-X state isn't reset properly
> >>> during reboot or re-assignment. In the above case, Xen set 
> >>> maskall bit
> >>> to mask all MSI interrupts after it detected a potential security
> >>> issue. Even after device reset, Xen didn't reset its internal 
> >>> maskall
> >>> bit. As a result, maskall bit would be set again in next write to
> >>> MSI-X message control register.
> >>>
> >>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting 
> >>> MSI-X
> >>> internal state of a device, we employ it to fix this issue rather 
> >>> than
> >>> introducing another dedicated sub-hypercall.
> >>>
> >>> Note that PHYSDEVOPS_release_msix() will fail if the mapping 
> >>> between
> >>> the device's msix and pirq has been created. This limitation 
> >>> prevents
> >>> us calling this function when detaching a device from a guest 
> >>> during
> >>> guest shutdown. Thus it is called right before calling
> >>> PHYSDEVOPS_prepare_msix().
> >> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop 
> >> the
> >> () at the end of the hypercall name since it's not a function.
> >>
> >> I'm also wondering why the release can't be done when the device is
> >> detached from the guest (or the guest has been shut down). This 
> >> makes
> >> me worry about the raciness of the attach/detach procedure: if 
> >> there's
> >> a state where pciback assumes the device has been detached from the
> >> guest, but there are still pirqs bound, an attempt to attach to
> >> another guest in such state will fail.
> >
> >I wonder whether this additional reset functionality could be done 
> >out
> >of xen_pcibk_xenbus_remove(). We first do a (best effort) device 
> >reset
> >and then do the extra things that are not properly done there.
>  
>  No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>  the handler of PHYSDEVOP_release_msix. To do a successful Xen 
>  internal
>  MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be 
>  finished
>  without error. But ATM, xen expects that no msi is bound to pirq when
>  doing PHYSDEVOP_release_msix. Otherwise it fails with error code 
>  -EBUSY.
>  However, the expectation isn't guaranteed in 
>  xen_pcibk_xenbus_remove().
>  In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>  at last minute, which happens after device reset in 
>  xen_pcibk_xenbus_remove().
> >>>
> >>>But that may need taking care of: I don't think it is a good idea to 
> >>>have
> >>>anything left from the prior owning domain when the device gets reset.
> >>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
> >>>invoking the reset;
> >> 
> >> Agree. How about pciback to track the established IRQ bindings? Then
> >> pciback can clear irq binding before invoking the reset.
> >
> >How would pciback even know of those mappings, when it's qemu
> >who establishes (and manages) them?
>  
>  I meant to expose some interfaces from pciback. And pciback serves
>  as the proxy of IRQ (un)binding APIs.
> >>>
> >>>If at all possible we should avoid having to change more parties (qemu,
> >>>libxc, kernel, hypervisor) than really necessary. Remember that such
> >>>a bug fix may want backporting, and making sure affected people have
> >>>all relevant components updated is increasingly difficult with their

Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2019-09-13 Thread Chao Gao
On Fri, Sep 13, 2019 at 10:02:24AM +, Spassov, Stanislav wrote:
>On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>> On 13.12.18 at 04:46,  wrote:
 On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
 On 12.12.18 at 16:18,  wrote:
>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>> On 12.12.18 at 08:06,  wrote:
 On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>> I find some pass-thru devices don't work any more across guest 
>>> reboot.
>>> Assigning it to another guest also meets the same issue. And the 
>>> only
>>> way to make it work again is un-binding and binding it to pciback.
>>> Someone reported this issue one year ago [1]. More detail also can 
>>> be
>>> found in [2].
>>>
>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>> during reboot or re-assignment. In the above case, Xen set maskall 
>>> bit
>>> to mask all MSI interrupts after it detected a potential security
>>> issue. Even after device reset, Xen didn't reset its internal 
>>> maskall
>>> bit. As a result, maskall bit would be set again in next write to
>>> MSI-X message control register.
>>>
>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting 
>>> MSI-X
>>> internal state of a device, we employ it to fix this issue rather 
>>> than
>>> introducing another dedicated sub-hypercall.
>>>
>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>> the device's msix and pirq has been created. This limitation 
>>> prevents
>>> us calling this function when detaching a device from a guest during
>>> guest shutdown. Thus it is called right before calling
>>> PHYSDEVOPS_prepare_msix().
>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>> () at the end of the hypercall name since it's not a function.
>>
>> I'm also wondering why the release can't be done when the device is
>> detached from the guest (or the guest has been shut down). This makes
>> me worry about the raciness of the attach/detach procedure: if 
>> there's
>> a state where pciback assumes the device has been detached from the
>> guest, but there are still pirqs bound, an attempt to attach to
>> another guest in such state will fail.
>
>I wonder whether this additional reset functionality could be done out
>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>and then do the extra things that are not properly done there.
 
 No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
 the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
 MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
 without error. But ATM, xen expects that no msi is bound to pirq when
 doing PHYSDEVOP_release_msix. Otherwise it fails with error code 
 -EBUSY.
 However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
 In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
 at last minute, which happens after device reset in 
 xen_pcibk_xenbus_remove().
>>>
>>>But that may need taking care of: I don't think it is a good idea to have
>>>anything left from the prior owning domain when the device gets reset.
>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>invoking the reset;
>> 
>> Agree. How about pciback to track the established IRQ bindings? Then
>> pciback can clear irq binding before invoking the reset.
>
>How would pciback even know of those mappings, when it's qemu
>who establishes (and manages) them?
 
 I meant to expose some interfaces from pciback. And pciback serves
 as the proxy of IRQ (un)binding APIs.
>>>
>>>If at all possible we should avoid having to change more parties (qemu,
>>>libxc, kernel, hypervisor) than really necessary. Remember that such
>>>a bug fix may want backporting, and making sure affected people have
>>>all relevant components updated is increasingly difficult with their
>>>number growing.
>>>
>>>in fact I'd expect this to happen in the course of
>>>domain destruction, and I'd expect the device reset to come after the
>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>stack?
>> 
>> I don't think reversing the sequences of device reset and domain
>> destruction would be simple. 

Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2019-09-13 Thread Spassov, Stanislav
On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
> On 13.12.18 at 04:46,  wrote:
>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>> On 12.12.18 at 16:18,  wrote:
> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
> On 12.12.18 at 08:06,  wrote:
>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>> I find some pass-thru devices don't work any more across guest 
>> reboot.
>> Assigning it to another guest also meets the same issue. And the only
>> way to make it work again is un-binding and binding it to pciback.
>> Someone reported this issue one year ago [1]. More detail also can be
>> found in [2].
>>
>> The root-cause is Xen's internal MSI-X state isn't reset properly
>> during reboot or re-assignment. In the above case, Xen set maskall 
>> bit
>> to mask all MSI interrupts after it detected a potential security
>> issue. Even after device reset, Xen didn't reset its internal maskall
>> bit. As a result, maskall bit would be set again in next write to
>> MSI-X message control register.
>>
>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting 
>> MSI-X
>> internal state of a device, we employ it to fix this issue rather 
>> than
>> introducing another dedicated sub-hypercall.
>>
>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>> the device's msix and pirq has been created. This limitation prevents
>> us calling this function when detaching a device from a guest during
>> guest shutdown. Thus it is called right before calling
>> PHYSDEVOPS_prepare_msix().
> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
> () at the end of the hypercall name since it's not a function.
>
> I'm also wondering why the release can't be done when the device is
> detached from the guest (or the guest has been shut down). This makes
> me worry about the raciness of the attach/detach procedure: if there's
> a state where pciback assumes the device has been detached from the
> guest, but there are still pirqs bound, an attempt to attach to
> another guest in such state will fail.

I wonder whether this additional reset functionality could be done out
of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
and then do the extra things that are not properly done there.
>>> 
>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>> at last minute, which happens after device reset in 
>>> xen_pcibk_xenbus_remove().
>>
>>But that may need taking care of: I don't think it is a good idea to have
>>anything left from the prior owning domain when the device gets reset.
>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>invoking the reset;
> 
> Agree. How about pciback to track the established IRQ bindings? Then
> pciback can clear irq binding before invoking the reset.

How would pciback even know of those mappings, when it's qemu
who establishes (and manages) them?
>>> 
>>> I meant to expose some interfaces from pciback. And pciback serves
>>> as the proxy of IRQ (un)binding APIs.
>>
>>If at all possible we should avoid having to change more parties (qemu,
>>libxc, kernel, hypervisor) than really necessary. Remember that such
>>a bug fix may want backporting, and making sure affected people have
>>all relevant components updated is increasingly difficult with their
>>number growing.
>>
>>in fact I'd expect this to happen in the course of
>>domain destruction, and I'd expect the device reset to come after the
>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>stack?
> 
> I don't think reversing the sequences of device reset and domain
> destruction would be simple. Furthermore, during device hot-unplug,
> device reset is done when the owner is alive. So if we use domain
> destruction to enforce all irq binding cleared, in theory, it won't be
> applicable to hot-unplug case (if qemu's 

Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-13 Thread Chao Gao
On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
 On 13.12.18 at 04:46,  wrote:
>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>> On 12.12.18 at 16:18,  wrote:
 On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
 On 12.12.18 at 08:06,  wrote:
>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
 On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
> I find some pass-thru devices don't work any more across guest reboot.
> Assigning it to another guest also meets the same issue. And the only
> way to make it work again is un-binding and binding it to pciback.
> Someone reported this issue one year ago [1]. More detail also can be
> found in [2].
>
> The root-cause is Xen's internal MSI-X state isn't reset properly
> during reboot or re-assignment. In the above case, Xen set maskall bit
> to mask all MSI interrupts after it detected a potential security
> issue. Even after device reset, Xen didn't reset its internal maskall
> bit. As a result, maskall bit would be set again in next write to
> MSI-X message control register.
>
> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
> internal state of a device, we employ it to fix this issue rather than
> introducing another dedicated sub-hypercall.
>
> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
> the device's msix and pirq has been created. This limitation prevents
> us calling this function when detaching a device from a guest during
> guest shutdown. Thus it is called right before calling
> PHYSDEVOPS_prepare_msix().
 s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
 () at the end of the hypercall name since it's not a function.

 I'm also wondering why the release can't be done when the device is
 detached from the guest (or the guest has been shut down). This makes
 me worry about the raciness of the attach/detach procedure: if there's
 a state where pciback assumes the device has been detached from the
 guest, but there are still pirqs bound, an attempt to attach to
 another guest in such state will fail.
>>>
>>>I wonder whether this additional reset functionality could be done out
>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>and then do the extra things that are not properly done there.
>> 
>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>> without error. But ATM, xen expects that no msi is bound to pirq when
>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>> at last minute, which happens after device reset in 
>> xen_pcibk_xenbus_remove().
>
>But that may need taking care of: I don't think it is a good idea to have
>anything left from the prior owning domain when the device gets reset.
>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>invoking the reset;
 
 Agree. How about pciback to track the established IRQ bindings? Then
 pciback can clear irq binding before invoking the reset.
>>>
>>>How would pciback even know of those mappings, when it's qemu
>>>who establishes (and manages) them?
>> 
>> I meant to expose some interfaces from pciback. And pciback serves
>> as the proxy of IRQ (un)binding APIs.
>
>If at all possible we should avoid having to change more parties (qemu,
>libxc, kernel, hypervisor) than really necessary. Remember that such
>a bug fix may want backporting, and making sure affected people have
>all relevant components updated is increasingly difficult with their
>number growing.
>
>in fact I'd expect this to happen in the course of
>domain destruction, and I'd expect the device reset to come after the
>domain was cleaned up. Perhaps simply an ordering issue in the tool
>stack?
 
 I don't think reversing the sequences of device reset and domain
 destruction would be simple. Furthermore, during device hot-unplug,
 device reset is done when the owner is alive. So if we use domain
 destruction to enforce all irq binding cleared, in theory, it won't be
 applicable to hot-unplug case (if qemu's hot-unplug logic is
 compromised).
>>>
>>>Even in the hot-unplug case the tool stack could issue unbind
>>>requests, behind the back of the possibly compromised qemu,
>>>once 

Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-12 Thread Jan Beulich
>>> On 13.12.18 at 04:46,  wrote:
> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
> On 12.12.18 at 16:18,  wrote:
>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>> On 12.12.18 at 08:06,  wrote:
> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
 I find some pass-thru devices don't work any more across guest reboot.
 Assigning it to another guest also meets the same issue. And the only
 way to make it work again is un-binding and binding it to pciback.
 Someone reported this issue one year ago [1]. More detail also can be
 found in [2].

 The root-cause is Xen's internal MSI-X state isn't reset properly
 during reboot or re-assignment. In the above case, Xen set maskall bit
 to mask all MSI interrupts after it detected a potential security
 issue. Even after device reset, Xen didn't reset its internal maskall
 bit. As a result, maskall bit would be set again in next write to
 MSI-X message control register.

 Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
 internal state of a device, we employ it to fix this issue rather than
 introducing another dedicated sub-hypercall.

 Note that PHYSDEVOPS_release_msix() will fail if the mapping between
 the device's msix and pirq has been created. This limitation prevents
 us calling this function when detaching a device from a guest during
 guest shutdown. Thus it is called right before calling
 PHYSDEVOPS_prepare_msix().
>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>> () at the end of the hypercall name since it's not a function.
>>>
>>> I'm also wondering why the release can't be done when the device is
>>> detached from the guest (or the guest has been shut down). This makes
>>> me worry about the raciness of the attach/detach procedure: if there's
>>> a state where pciback assumes the device has been detached from the
>>> guest, but there are still pirqs bound, an attempt to attach to
>>> another guest in such state will fail.
>>
>>I wonder whether this additional reset functionality could be done out
>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>and then do the extra things that are not properly done there.
> 
> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
> without error. But ATM, xen expects that no msi is bound to pirq when
> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
> at last minute, which happens after device reset in 
> xen_pcibk_xenbus_remove().

But that may need taking care of: I don't think it is a good idea to have
anything left from the prior owning domain when the device gets reset.
I.e. left over IRQ bindings should perhaps be forcibly cleared before
invoking the reset;
>>> 
>>> Agree. How about pciback to track the established IRQ bindings? Then
>>> pciback can clear irq binding before invoking the reset.
>>
>>How would pciback even know of those mappings, when it's qemu
>>who establishes (and manages) them?
> 
> I meant to expose some interfaces from pciback. And pciback serves
> as the proxy of IRQ (un)binding APIs.

If at all possible we should avoid having to change more parties (qemu,
libxc, kernel, hypervisor) than really necessary. Remember that such
a bug fix may want backporting, and making sure affected people have
all relevant components updated is increasingly difficult with their
number growing.

in fact I'd expect this to happen in the course of
domain destruction, and I'd expect the device reset to come after the
domain was cleaned up. Perhaps simply an ordering issue in the tool
stack?
>>> 
>>> I don't think reversing the sequences of device reset and domain
>>> destruction would be simple. Furthermore, during device hot-unplug,
>>> device reset is done when the owner is alive. So if we use domain
>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>> compromised).
>>
>>Even in the hot-unplug case the tool stack could issue unbind
>>requests, behind the back of the possibly compromised qemu,
>>once neither the guest nor qemu have access to the device
>>anymore.
> 
> But currently, tool stack doesn't know the remaining IRQ bindings.
> If tool stack can 

Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-12 Thread Chao Gao
On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
 On 12.12.18 at 16:18,  wrote:
>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>> On 12.12.18 at 08:06,  wrote:
 On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>> I find some pass-thru devices don't work any more across guest reboot.
>>> Assigning it to another guest also meets the same issue. And the only
>>> way to make it work again is un-binding and binding it to pciback.
>>> Someone reported this issue one year ago [1]. More detail also can be
>>> found in [2].
>>>
>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>> to mask all MSI interrupts after it detected a potential security
>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>> bit. As a result, maskall bit would be set again in next write to
>>> MSI-X message control register.
>>>
>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>> internal state of a device, we employ it to fix this issue rather than
>>> introducing another dedicated sub-hypercall.
>>>
>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>> the device's msix and pirq has been created. This limitation prevents
>>> us calling this function when detaching a device from a guest during
>>> guest shutdown. Thus it is called right before calling
>>> PHYSDEVOPS_prepare_msix().
>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>> () at the end of the hypercall name since it's not a function.
>>
>> I'm also wondering why the release can't be done when the device is
>> detached from the guest (or the guest has been shut down). This makes
>> me worry about the raciness of the attach/detach procedure: if there's
>> a state where pciback assumes the device has been detached from the
>> guest, but there are still pirqs bound, an attempt to attach to
>> another guest in such state will fail.
>
>I wonder whether this additional reset functionality could be done out
>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>and then do the extra things that are not properly done there.
 
 No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
 the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
 MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
 without error. But ATM, xen expects that no msi is bound to pirq when
 doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
 However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
 In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
 at last minute, which happens after device reset in 
 xen_pcibk_xenbus_remove().
>>>
>>>But that may need taking care of: I don't think it is a good idea to have
>>>anything left from the prior owning domain when the device gets reset.
>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>invoking the reset;
>> 
>> Agree. How about pciback to track the established IRQ bindings? Then
>> pciback can clear irq binding before invoking the reset.
>
>How would pciback even know of those mappings, when it's qemu
>who establishes (and manages) them?

I meant to expose some interfaces from pciback. And pciback serves
as the proxy of IRQ (un)binding APIs.

>
>>>in fact I'd expect this to happen in the course of
>>>domain destruction, and I'd expect the device reset to come after the
>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>stack?
>> 
>> I don't think reversing the sequences of device reset and domain
>> destruction would be simple. Furthermore, during device hot-unplug,
>> device reset is done when the owner is alive. So if we use domain
>> destruction to enforce all irq binding cleared, in theory, it won't be
>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>> compromised).
>
>Even in the hot-unplug case the tool stack could issue unbind
>requests, behind the back of the possibly compromised qemu,
>once neither the guest nor qemu have access to the device
>anymore.

But currently, tool stack doesn't know the remaining IRQ bindings.
If tool stack can maintaine IRQ binding information of a pass-thru
device (stored in Xenstore?), we can come up with a clean solution
without modifying linux kernel and Xen.

Thanks
Chao


Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-12 Thread Jan Beulich
>>> On 12.12.18 at 16:18,  wrote:
> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
> On 12.12.18 at 08:06,  wrote:
>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>> I find some pass-thru devices don't work any more across guest reboot.
>> Assigning it to another guest also meets the same issue. And the only
>> way to make it work again is un-binding and binding it to pciback.
>> Someone reported this issue one year ago [1]. More detail also can be
>> found in [2].
>>
>> The root-cause is Xen's internal MSI-X state isn't reset properly
>> during reboot or re-assignment. In the above case, Xen set maskall bit
>> to mask all MSI interrupts after it detected a potential security
>> issue. Even after device reset, Xen didn't reset its internal maskall
>> bit. As a result, maskall bit would be set again in next write to
>> MSI-X message control register.
>>
>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>> internal state of a device, we employ it to fix this issue rather than
>> introducing another dedicated sub-hypercall.
>>
>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>> the device's msix and pirq has been created. This limitation prevents
>> us calling this function when detaching a device from a guest during
>> guest shutdown. Thus it is called right before calling
>> PHYSDEVOPS_prepare_msix().
> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
> () at the end of the hypercall name since it's not a function.
>
> I'm also wondering why the release can't be done when the device is
> detached from the guest (or the guest has been shut down). This makes
> me worry about the raciness of the attach/detach procedure: if there's
> a state where pciback assumes the device has been detached from the
> guest, but there are still pirqs bound, an attempt to attach to
> another guest in such state will fail.

I wonder whether this additional reset functionality could be done out
of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
and then do the extra things that are not properly done there.
>>> 
>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>> at last minute, which happens after device reset in 
>>> xen_pcibk_xenbus_remove().
>>
>>But that may need taking care of: I don't think it is a good idea to have
>>anything left from the prior owning domain when the device gets reset.
>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>invoking the reset;
> 
> Agree. How about pciback to track the established IRQ bindings? Then
> pciback can clear irq binding before invoking the reset.

How would pciback even know of those mappings, when it's qemu
who establishes (and manages) them?

>>in fact I'd expect this to happen in the course of
>>domain destruction, and I'd expect the device reset to come after the
>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>stack?
> 
> I don't think reversing the sequences of device reset and domain
> destruction would be simple. Furthermore, during device hot-unplug,
> device reset is done when the owner is alive. So if we use domain
> destruction to enforce all irq binding cleared, in theory, it won't be
> applicable to hot-unplug case (if qemu's hot-unplug logic is
> compromised).

Even in the hot-unplug case the tool stack could issue unbind
requests, behind the back of the possibly compromised qemu,
once neither the guest nor qemu have access to the device
anymore.

Jan



Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-12 Thread Chao Gao
On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
 On 12.12.18 at 08:06,  wrote:
>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
 On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
> I find some pass-thru devices don't work any more across guest reboot.
> Assigning it to another guest also meets the same issue. And the only
> way to make it work again is un-binding and binding it to pciback.
> Someone reported this issue one year ago [1]. More detail also can be
> found in [2].
>
> The root-cause is Xen's internal MSI-X state isn't reset properly
> during reboot or re-assignment. In the above case, Xen set maskall bit
> to mask all MSI interrupts after it detected a potential security
> issue. Even after device reset, Xen didn't reset its internal maskall
> bit. As a result, maskall bit would be set again in next write to
> MSI-X message control register.
>
> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
> internal state of a device, we employ it to fix this issue rather than
> introducing another dedicated sub-hypercall.
>
> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
> the device's msix and pirq has been created. This limitation prevents
> us calling this function when detaching a device from a guest during
> guest shutdown. Thus it is called right before calling
> PHYSDEVOPS_prepare_msix().
 s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
 () at the end of the hypercall name since it's not a function.

 I'm also wondering why the release can't be done when the device is
 detached from the guest (or the guest has been shut down). This makes
 me worry about the raciness of the attach/detach procedure: if there's
 a state where pciback assumes the device has been detached from the
 guest, but there are still pirqs bound, an attempt to attach to
 another guest in such state will fail.
>>>
>>>I wonder whether this additional reset functionality could be done out
>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>and then do the extra things that are not properly done there.
>> 
>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>> without error. But ATM, xen expects that no msi is bound to pirq when
>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>> at last minute, which happens after device reset in 
>> xen_pcibk_xenbus_remove().
>
>But that may need taking care of: I don't think it is a good idea to have
>anything left from the prior owning domain when the device gets reset.
>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>invoking the reset;

Agree. How about pciback to track the established IRQ bindings? Then
pciback can clear irq binding before invoking the reset.

>in fact I'd expect this to happen in the course of
>domain destruction, and I'd expect the device reset to come after the
>domain was cleaned up. Perhaps simply an ordering issue in the tool
>stack?

I don't think reversing the sequences of device reset and domain
destruction would be simple. Furthermore, during device hot-unplug,
device reset is done when the owner is alive. So if we use domain
destruction to enforce all irq binding cleared, in theory, it won't be
applicable to hot-unplug case (if qemu's hot-unplug logic is
compromised).

Thanks
Chao


Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-12 Thread Jan Beulich
>>> On 12.12.18 at 08:06,  wrote:
> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
 I find some pass-thru devices don't work any more across guest reboot.
 Assigning it to another guest also meets the same issue. And the only
 way to make it work again is un-binding and binding it to pciback.
 Someone reported this issue one year ago [1]. More detail also can be
 found in [2].

 The root-cause is Xen's internal MSI-X state isn't reset properly
 during reboot or re-assignment. In the above case, Xen set maskall bit
 to mask all MSI interrupts after it detected a potential security
 issue. Even after device reset, Xen didn't reset its internal maskall
 bit. As a result, maskall bit would be set again in next write to
 MSI-X message control register.

 Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
 internal state of a device, we employ it to fix this issue rather than
 introducing another dedicated sub-hypercall.

 Note that PHYSDEVOPS_release_msix() will fail if the mapping between
 the device's msix and pirq has been created. This limitation prevents
 us calling this function when detaching a device from a guest during
 guest shutdown. Thus it is called right before calling
 PHYSDEVOPS_prepare_msix().
>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>> () at the end of the hypercall name since it's not a function.
>>>
>>> I'm also wondering why the release can't be done when the device is
>>> detached from the guest (or the guest has been shut down). This makes
>>> me worry about the raciness of the attach/detach procedure: if there's
>>> a state where pciback assumes the device has been detached from the
>>> guest, but there are still pirqs bound, an attempt to attach to
>>> another guest in such state will fail.
>>
>>I wonder whether this additional reset functionality could be done out
>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>and then do the extra things that are not properly done there.
> 
> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
> without error. But ATM, xen expects that no msi is bound to pirq when
> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
> at last minute, which happens after device reset in 
> xen_pcibk_xenbus_remove().

But that may need taking care of: I don't think it is a good idea to have
anything left from the prior owning domain when the device gets reset.
I.e. left over IRQ bindings should perhaps be forcibly cleared before
invoking the reset; in fact I'd expect this to happen in the course of
domain destruction, and I'd expect the device reset to come after the
domain was cleaned up. Perhaps simply an ordering issue in the tool
stack?

Jan



Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device

2018-12-11 Thread Chao Gao
On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>> I find some pass-thru devices don't work any more across guest reboot.
>>> Assigning it to another guest also meets the same issue. And the only
>>> way to make it work again is un-binding and binding it to pciback.
>>> Someone reported this issue one year ago [1]. More detail also can be
>>> found in [2].
>>>
>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>> to mask all MSI interrupts after it detected a potential security
>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>> bit. As a result, maskall bit would be set again in next write to
>>> MSI-X message control register.
>>>
>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>> internal state of a device, we employ it to fix this issue rather than
>>> introducing another dedicated sub-hypercall.
>>>
>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>> the device's msix and pirq has been created. This limitation prevents
>>> us calling this function when detaching a device from a guest during
>>> guest shutdown. Thus it is called right before calling
>>> PHYSDEVOPS_prepare_msix().
>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>> () at the end of the hypercall name since it's not a function.
>>
>> I'm also wondering why the release can't be done when the device is
>> detached from the guest (or the guest has been shut down). This makes
>> me worry about the raciness of the attach/detach procedure: if there's
>> a state where pciback assumes the device has been detached from the
>> guest, but there are still pirqs bound, an attempt to attach to
>> another guest in such state will fail.
>
>I wonder whether this additional reset functionality could be done out
>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>and then do the extra things that are not properly done there.

No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
without error. But ATM, xen expects that no msi is bound to pirq when
doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
at last minute, which happens after device reset in xen_pcibk_xenbus_remove().

Thanks
Chao