Hi Jan,

Jan Beulich <[email protected]> writes:

> On 05.07.2023 10:59, Roger Pau Monné wrote:
>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>> I am currently implementing your proposal (along with Jan's
>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>> reassign_device() call, which has this piece of code:
>>>>
>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>
>>>> My immediate change was:
>>>>
>>>>         write_lock(&pdev->domain->pci_lock);
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>
>>>>         write_lock(&target->pci_lock);
>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>         write_unlock(&target->pci_lock);
>>>>
>>>> But this will not work because reassign_device is called from
>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>> take a d->pci_lock early.
>>>>
>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>> list one at time:
>>>>
>>>> int pci_release_devices(struct domain *d)
>>>> {
>>>>     struct pci_dev *pdev;
>>>>     u8 bus, devfn;
>>>>     int ret;
>>>>
>>>>     pcidevs_lock();
>>>>     write_lock(&d->pci_lock);
>>>>     ret = arch_pci_clean_pirqs(d);
>>>>     if ( ret )
>>>>     {
>>>>         pcidevs_unlock();
>>>>         write_unlock(&d->pci_lock);
>>>>         return ret;
>>>>     }
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>         bus = pdev->bus;
>>>>         devfn = pdev->devfn;
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&d->pci_lock);
>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>         write_lock(&d->pci_lock);
>>>
>>> I think it needs doing almost like this, but with two more tweaks and
>>> no list_del() right here (first and foremost to avoid needing to
>>> figure whether removing early isn't going to subtly break anything;
>>> see below for an error case that would end up with changed behavior):
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct 
>>> pci_dev, domain_list);
>>>         uint16_t seg = pdev->seg;
>>>         uint8_t bus = pdev->bus;
>>>         uint8_t devfn = pdev->devfn;
>>>
>>>         write_unlock(&d->pci_lock);
>> 
>> I think you need to remove the device from the pdev_list before
>> dropping the lock, or else release could race with other operations.
>> 
>> That's unlikely, but still if the lock is dropped and the routine
>> needs to operate on the device it is better remove such device from
>> the domain so other operations cannot get a reference to it.
>> 
>> Otherwise you could modify reassign_device() implementations so they
>> require the caller to hold the source->pci_lock when calling the
>> routine, but that's ugly because the lock would need to be dropped in
>> order to reassign the device from source to target domains.
>> 
>> Another option would be to move the whole d->pdev_list to a local
>> variable (so that d->pdev_list would be empty) and then iterate over
>> it without the d->pci_lock.  On failure you would take the lock and
>> add the failing device back into d->pdev_list.
>
> Conceptually I like this last variant, but like the individual list_del()
> it requires auditing code for no dependency on the device still being on
> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
> function would then need changing to have struct pci_dev * passed in.
> Yet who knows where else there are uses of pci_get_pdev() lurking.

Okay, so I changed deassign_device() signature and reworked the loop
in pci_release_devices() in a such way:

    INIT_LIST_HEAD(&tmp_list);
    /* Move all entries to tmp_list, so we can drop d->pci_lock */
    list_splice_init(&d->pdev_list, &tmp_list);
    write_unlock(&d->pci_lock);

    list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
    {
        pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
        rc = deassign_device(d, pdev);
        if ( rc )
        {
            /* Return device back to the domain list */
            write_lock(&d->pci_lock);
            list_add(&pdev->domain_list, &d->pdev_list);
            write_unlock(&d->pci_lock);
            func_ret = rc;
        }
    }


Also, I checked for all pci_get_pdev() calls and found that struct
domain (the first parameter) is passed only in handful of places:

*** xen/drivers/vpci/vpci.c:
vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);

*** xen/arch/x86/irq.c:
map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);

*** xen/drivers/passthrough/pci.c:
XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);

The last one is due to my change to deassign_device() signature.

==============================

d->pdev_list can be accessed there:

*** xen/drivers/passthrough/amd/pci_amd_iommu.c:
reassign_device[489]           list_add(&pdev->domain_list, &target->pdev_list);

*** xen/drivers/passthrough/pci.c:
_pci_hide_device[463]          list_add(&pdev->domain_list, 
&dom_xen->pdev_list);
pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, 
domain_list )
pci_add_device[759]            list_add(&pdev->domain_list, 
&hardware_domain->pdev_list);
pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct pci_dev, 
domain_list);
pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
_setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, 
&ctxt->d->pdev_list);

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2819] list_add(&pdev->domain_list, 
&target->pdev_list);

*** xen/include/xen/pci.h:
for_each_pdev[149]             list_for_each_entry(pdev, &(domain)->pdev_list, 
domain_list)
has_arch_pdevs[151]            #define has_arch_pdevs(d) 
(!list_empty(&(d)->pdev_list))

==============================

And has_arch_pdevs() is used there:

*** xen/arch/x86/hvm/hvm.c:
hvm_set_cr0[2388]              has_arch_pdevs(d)) )

*** xen/arch/x86/hvm/vmx/vmcs.c:
vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop

*** xen/arch/x86/mm.c:
l1_disallow_mask[172]          !has_arch_pdevs(d) && \

*** xen/arch/x86/mm/p2m-pod.c:
p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || 
cache_flush_permitted(d) )
guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || 
cache_flush_permitted(d) )

*** xen/arch/x86/mm/paging.c:
paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )


has_arch_pdevs() bothers me most, actually, because it is not always
obvious how to add locking for the callers. I am planning to rework it
in the following way:

#define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))

static inline bool has_arch_pdevs(struct domain *d)
{
    bool ret;

    read_lock(&d->pci_lock);
    ret = has_arch_pdevs_unlocked(d);
    read_unlock(&d->pci_lock);

    return ret;
}

And then use appropriate macro/function depending on circumstances.


-- 
WBR, Volodymyr

Reply via email to