Re: [PATCH v5 6/6] VT-d: support the device IOTLB
David Woodhouse wrote: On Mon, 2009-05-18 at 13:51 +0800, Yu Zhao wrote: @@ -965,6 +1037,8 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, else iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH); + if (did) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) Hm, why 'if (did)' ? Domain ID zero is only special in caching mode. Should it be: if (!cap_caching_mode(iommu-cap) || did) ? Yes, you are right. Domain ID 0 is only reserved for caching mode. Will send a fix for this. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 resend 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
David Woodhouse wrote: On Thu, 2009-05-14 at 10:32 +0800, Yu Zhao wrote: Make iommu_flush_iotlb_psi() and flush_unmaps() more readable. This doesn't apply any more. Sorry, I'll rebase those patches and post them again. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 0/6] ATS capability support for Intel IOMMU
This patch series implements Address Translation Service support for the Intel IOMMU. The PCIe Endpoint that supports ATS capability can request the DMA address translation from the IOMMU and cache the translation itself. This can alleviate IOMMU TLB pressure and improve the hardware performance in the I/O virtualization environment. The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The spec can be found at: http://www.pcisig.com/specifications/iov/ats/ (it requires membership). Changelog: v4 - v5 1, rebase to the latest IOMMU tree v3 - v4 1, coding style fixes (Grant Grundler) 2, support the Virtual Function ATS capability v2 - v3 1, throw error message if VT-d hardware detects invalid descriptor on Queued Invalidation interface (David Woodhouse) 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability PCI: handle Virtual Function ATS enabling VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c| 189 ++--- drivers/pci/intel-iommu.c | 141 +-- drivers/pci/iov.c | 155 -- drivers/pci/pci.h | 39 + include/linux/dma_remapping.h |1 + include/linux/dmar.h |9 ++ include/linux/intel-iommu.h | 16 +++- include/linux/pci.h |2 + include/linux/pci_regs.h | 10 ++ 9 files changed, 514 insertions(+), 48 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/6] PCI: support the ATS capability
The PCIe ATS capability makes the Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the device side, thus alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. Signed-off-by: Yu Zhao yu.z...@intel.com Acked-by: Jesse Barnes jbar...@virtuousgeek.org --- drivers/pci/iov.c| 105 ++ drivers/pci/pci.h| 37 include/linux/pci.h |2 + include/linux/pci_regs.h | 10 4 files changed, 154 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index b497daa..0a7a1b4 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -5,6 +5,7 @@ * * PCI Express I/O Virtualization (IOV) support. * Single Root IOV 1.0 + * Address Translation Service 1.0 */ #include linux/pci.h @@ -679,3 +680,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev) return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; } EXPORT_SYMBOL_GPL(pci_sriov_migration); + +static int ats_alloc_one(struct pci_dev *dev, int ps) +{ + int pos; + u16 cap; + struct pci_ats *ats; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + ats = kzalloc(sizeof(*ats), GFP_KERNEL); + if (!ats) + return -ENOMEM; + + ats-pos = pos; + ats-stu = ps; + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; + dev-ats = ats; + + return 0; +} + +static void ats_free_one(struct pci_dev *dev) +{ + kfree(dev-ats); + dev-ats = NULL; +} + +/** + * pci_enable_ats - enable the ATS capability + * @dev: the PCI device + * @ps: the IOMMU page shift + * + * Returns 0 on success, or negative on failure. + */ +int pci_enable_ats(struct pci_dev *dev, int ps) +{ + int rc; + u16 ctrl; + + BUG_ON(dev-ats); + + if (ps PCI_ATS_MIN_STU) + return -EINVAL; + + rc = ats_alloc_one(dev, ps); + if (rc) + return rc; + + ctrl = PCI_ATS_CTRL_ENABLE; + ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + return 0; +} + +/** + * pci_disable_ats - disable the ATS capability + * @dev: the PCI device + */ +void pci_disable_ats(struct pci_dev *dev) +{ + u16 ctrl; + + BUG_ON(!dev-ats); + + pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + ctrl = ~PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + ats_free_one(dev); +} + +/** + * pci_ats_queue_depth - query the ATS Invalidate Queue Depth + * @dev: the PCI device + * + * Returns the queue depth on success, or negative on failure. + * + * The ATS spec uses 0 in the Invalidate Queue Depth field to + * indicate that the function can accept 32 Invalidate Request. + * But here we use the `real' values (i.e. 1~32) for the Queue + * Depth. + */ +int pci_ats_queue_depth(struct pci_dev *dev) +{ + int pos; + u16 cap; + + if (dev-ats) + return dev-ats-qdep; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + + return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d03f6b9..3c2ec64 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -229,6 +229,13 @@ struct pci_sriov { u8 __iomem *mstate; /* VF Migration State Array */ }; +/* Address Translation Service */ +struct pci_ats { + int pos;/* capability position */ + int stu;/* Smallest Translation Unit */ + int qdep; /* Invalidate Queue Depth */ +}; + #ifdef CONFIG_PCI_IOV extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); @@ -236,6 +243,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); extern int pci_iov_bus_range(struct pci_bus *bus); + +extern int pci_enable_ats(struct pci_dev *dev, int ps); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_queue_depth(struct pci_dev *dev); +/** + * pci_ats_enabled - query the ATS status + * @dev: the PCI device + * + * Returns 1 if ATS capability is enabled, or 0 if not. + */ +static inline int pci_ats_enabled(struct pci_dev *dev) +{ + return !!dev-ats; +} #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -257,6 +278,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus
[PATCH v5 2/6] PCI: handle Virtual Function ATS enabling
The SR-IOV spec requires that the Smallest Translation Unit and the Invalidate Queue Depth fields in the Virtual Function ATS capability are hardwired to 0. If a function is a Virtual Function, then and set its Physical Function's STU before enabling the ATS. Signed-off-by: Yu Zhao yu.z...@intel.com Acked-by: Jesse Barnes jbar...@virtuousgeek.org --- drivers/pci/iov.c | 66 +--- drivers/pci/pci.h |4 ++- 2 files changed, 55 insertions(+), 15 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0a7a1b4..4151404 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -491,10 +491,10 @@ found: if (pdev) iov-dev = pci_dev_get(pdev); - else { + else iov-dev = dev; - mutex_init(iov-lock); - } + + mutex_init(iov-lock); dev-sriov = iov; dev-is_physfn = 1; @@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev) { BUG_ON(dev-sriov-nr_virtfn); - if (dev == dev-sriov-dev) - mutex_destroy(dev-sriov-lock); - else + if (dev != dev-sriov-dev) pci_dev_put(dev-sriov-dev); + mutex_destroy(dev-sriov-lock); + kfree(dev-sriov); dev-sriov = NULL; } @@ -723,19 +723,40 @@ int pci_enable_ats(struct pci_dev *dev, int ps) int rc; u16 ctrl; - BUG_ON(dev-ats); + BUG_ON(dev-ats dev-ats-is_enabled); if (ps PCI_ATS_MIN_STU) return -EINVAL; - rc = ats_alloc_one(dev, ps); - if (rc) - return rc; + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + if (pdev-ats) + rc = pdev-ats-stu == ps ? 0 : -EINVAL; + else + rc = ats_alloc_one(pdev, ps); + + if (!rc) + pdev-ats-ref_cnt++; + mutex_unlock(pdev-sriov-lock); + if (rc) + return rc; + } + + if (!dev-is_physfn) { + rc = ats_alloc_one(dev, ps); + if (rc) + return rc; + } ctrl = PCI_ATS_CTRL_ENABLE; - ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); + if (!dev-is_virtfn) + ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + dev-ats-is_enabled = 1; + return 0; } @@ -747,13 +768,26 @@ void pci_disable_ats(struct pci_dev *dev) { u16 ctrl; - BUG_ON(!dev-ats); + BUG_ON(!dev-ats || !dev-ats-is_enabled); pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); ctrl = ~PCI_ATS_CTRL_ENABLE; pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); - ats_free_one(dev); + dev-ats-is_enabled = 0; + + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + pdev-ats-ref_cnt--; + if (!pdev-ats-ref_cnt) + ats_free_one(pdev); + mutex_unlock(pdev-sriov-lock); + } + + if (!dev-is_physfn) + ats_free_one(dev); } /** @@ -765,13 +799,17 @@ void pci_disable_ats(struct pci_dev *dev) * The ATS spec uses 0 in the Invalidate Queue Depth field to * indicate that the function can accept 32 Invalidate Request. * But here we use the `real' values (i.e. 1~32) for the Queue - * Depth. + * Depth; and 0 indicates the function shares the Queue with + * other functions (doesn't exclusively own a Queue). */ int pci_ats_queue_depth(struct pci_dev *dev) { int pos; u16 cap; + if (dev-is_virtfn) + return 0; + if (dev-ats) return dev-ats-qdep; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 3c2ec64..f73bcbe 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -234,6 +234,8 @@ struct pci_ats { int pos;/* capability position */ int stu;/* Smallest Translation Unit */ int qdep; /* Invalidate Queue Depth */ + int ref_cnt;/* Physical Function reference count */ + int is_enabled:1; /* Enable bit is set */ }; #ifdef CONFIG_PCI_IOV @@ -255,7 +257,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev); */ static inline int pci_ats_enabled(struct pci_dev *dev) { - return !!dev-ats; + return dev-ats dev-ats-is_enabled; } #else static inline int pci_iov_init(struct pci_dev *dev) -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in the DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 112 -- include/linux/dmar.h|9 include/linux/intel-iommu.h |1 + 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f23460a..6d7f961 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -267,6 +267,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +static LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int rc; + struct acpi_dmar_atsr *atsr; + + if (atsru-include_all) + return 0; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + rc = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + if (rc || !atsru-devices_cnt) { + list_del(atsru-list); + kfree(atsru); + } + + return rc; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -274,22 +352,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -363,6 +447,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n); @@ -431,11 +520,19 @@ int __init dmar_dev_scope_init(void) #ifdef CONFIG_DMAR { struct
[PATCH v5 4/6] VT-d: add device IOTLB invalidation support
Support device IOTLB invalidation to flush the translation cached in the Endpoint. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 77 ++ include/linux/intel-iommu.h | 14 +++- 2 files changed, 82 insertions(+), 9 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index 6d7f961..7b287cb 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -699,7 +699,8 @@ void free_iommu(struct intel_iommu *iommu) */ static inline void reclaim_free_desc(struct q_inval *qi) { - while (qi-desc_status[qi-free_tail] == QI_DONE) { + while (qi-desc_status[qi-free_tail] == QI_DONE || + qi-desc_status[qi-free_tail] == QI_ABORT) { qi-desc_status[qi-free_tail] = QI_FREE; qi-free_tail = (qi-free_tail + 1) % QI_LENGTH; qi-free_cnt++; @@ -709,10 +710,13 @@ static inline void reclaim_free_desc(struct q_inval *qi) static int qi_check_fault(struct intel_iommu *iommu, int index) { u32 fault; - int head; + int head, tail; struct q_inval *qi = iommu-qi; int wait_index = (index + 1) % QI_LENGTH; + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + fault = readl(iommu-reg + DMAR_FSTS_REG); /* @@ -722,7 +726,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ if (fault DMA_FSTS_IQE) { head = readl(iommu-reg + DMAR_IQH_REG); - if ((head 4) == index) { + if ((head DMAR_IQ_SHIFT) == index) { + printk(KERN_ERR VT-d detected invalid descriptor: + low=%llx, high=%llx\n, + (unsigned long long)qi-desc[index].low, + (unsigned long long)qi-desc[index].high); memcpy(qi-desc[index], qi-desc[wait_index], sizeof(struct qi_desc)); __iommu_flush_cache(iommu, qi-desc[index], @@ -732,6 +740,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) } } + /* +* If ITE happens, all pending wait_desc commands are aborted. +* No new descriptors are fetched until the ITE is cleared. +*/ + if (fault DMA_FSTS_ITE) { + head = readl(iommu-reg + DMAR_IQH_REG); + head = ((head DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH; + head |= 1; + tail = readl(iommu-reg + DMAR_IQT_REG); + tail = ((tail DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH; + + writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG); + + do { + if (qi-desc_status[head] == QI_IN_USE) + qi-desc_status[head] = QI_ABORT; + head = (head - 2 + QI_LENGTH) % QI_LENGTH; + } while (head != tail); + + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + } + + if (fault DMA_FSTS_ICE) + writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG); + return 0; } @@ -741,7 +775,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { - int rc = 0; + int rc; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; @@ -752,6 +786,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw = qi-desc; +restart: + rc = 0; + spin_lock_irqsave(qi-q_lock, flags); while (qi-free_cnt 3) { spin_unlock_irqrestore(qi-q_lock, flags); @@ -782,7 +819,7 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) * update the HW tail register indicating the presence of * new descriptors. */ - writel(qi-free_head 4, iommu-reg + DMAR_IQT_REG); + writel(qi-free_head DMAR_IQ_SHIFT, iommu-reg + DMAR_IQT_REG); while (qi-desc_status[wait_index] != QI_DONE) { /* @@ -794,18 +831,21 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) */ rc = qi_check_fault(iommu, index); if (rc) - goto out; + break; spin_unlock(qi-q_lock); cpu_relax(); spin_lock(qi-q_lock); } -out: - qi-desc_status[index] = qi-desc_status[wait_index] = QI_DONE; + + qi-desc_status[index] = QI_DONE; reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + if (rc == -EAGAIN) + goto restart; + return rc; } @@ -857,6 +897,27 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr
[PATCH v5 6/6] VT-d: support the device IOTLB
Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 109 +--- include/linux/dma_remapping.h |1 + include/linux/intel-iommu.h |1 + 3 files changed, 102 insertions(+), 9 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 6d7cb84..c3cdfc9 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -252,6 +252,7 @@ struct device_domain_info { u8 bus; /* PCI bus number */ u8 devfn; /* PCI devfn number */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -945,6 +946,77 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, (unsigned long long)DMA_TLB_IAIG(val)); } +static struct device_domain_info *iommu_support_dev_iotlb( + struct dmar_domain *domain, int segment, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(segment, bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found || !info-dev) + return NULL; + + if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS)) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + if (!info) + return; + + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (!info-dev || !pci_ats_enabled(info-dev)) + return; + + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned mask) +{ + u16 sid, qdep; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + qdep = pci_ats_queue_depth(info-dev); + qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages) { @@ -965,6 +1037,8 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, else iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH); + if (did) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) @@ -1305,6 +1379,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, int segment, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info = NULL; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1372,15 +1447,21 @@ static int domain_context_mapping_one(struct dmar_domain *domain, int segment, context_set_domain_id(context, id); + if (translation != CONTEXT_TT_PASS_THROUGH) { + info = iommu_support_dev_iotlb(domain, segment, bus, devfn); + translation = info ? CONTEXT_TT_DEV_IOTLB : +CONTEXT_TT_MULTI_LEVEL; + } /* * In pass through mode, AW must be programmed to indicate the largest * AGAW value supported by hardware. And ASR is ignored by hardware. */ - if (likely(translation == CONTEXT_TT_MULTI_LEVEL)) { - context_set_address_width(context, iommu-agaw); - context_set_address_root(context, virt_to_phys(pgd)); - } else + if (unlikely(translation == CONTEXT_TT_PASS_THROUGH)) context_set_address_width(context, iommu-msagaw); + else { + context_set_address_root(context, virt_to_phys(pgd
[PATCH v4 resend 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
Make iommu_flush_iotlb_psi() and flush_unmaps() more readable. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 46 +--- 1 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 001b328..a2cbc01 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -968,30 +968,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { - unsigned int mask; + int rc; + unsigned int mask = ilog2(__roundup_pow_of_two(pages)); BUG_ON(addr (~VTD_PAGE_MASK)); BUG_ON(pages == 0); - /* Fallback to domain selective flush if no PSI support */ - if (!cap_pgsel_inv(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, - non_present_entry_flush); - /* +* Fallback to domain selective flush if no PSI support or the size is +* too big. * PSI requires page size to be 2 ^ x, and the base address is naturally * aligned to the size */ - mask = ilog2(__roundup_pow_of_two(pages)); - /* Fallback to domain selective flush if size is too big */ - if (mask cap_max_amask_val(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, non_present_entry_flush); - - return iommu-flush.flush_iotlb(iommu, did, addr, mask, - DMA_TLB_PSI_FLUSH, - non_present_entry_flush); + if (!cap_pgsel_inv(iommu-cap) || mask cap_max_amask_val(iommu-cap)) + rc = iommu-flush.flush_iotlb(iommu, did, 0, 0, + DMA_TLB_DSI_FLUSH, + non_present_entry_flush); + else + rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, + DMA_TLB_PSI_FLUSH, + non_present_entry_flush); + return rc; } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) @@ -2214,15 +2211,16 @@ static void flush_unmaps(void) if (!iommu) continue; - if (deferred_flush[i].next) { - iommu-flush.flush_iotlb(iommu, 0, 0, 0, -DMA_TLB_GLOBAL_FLUSH, 0); - for (j = 0; j deferred_flush[i].next; j++) { - __free_iova(deferred_flush[i].domain[j]-iovad, - deferred_flush[i].iova[j]); - } - deferred_flush[i].next = 0; + if (!deferred_flush[i].next) + continue; + + iommu-flush.flush_iotlb(iommu, 0, 0, 0, +DMA_TLB_GLOBAL_FLUSH, 0); + for (j = 0; j deferred_flush[i].next; j++) { + __free_iova(deferred_flush[i].domain[j]-iovad, + deferred_flush[i].iova[j]); } + deferred_flush[i].next = 0; } list_size = 0; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 resend 0/6] ATS capability support for Intel IOMMU
This patch series implements Address Translation Service support for the Intel IOMMU. The PCIe Endpoint that supports ATS capability can request the DMA address translation from the IOMMU and cache the translation itself. This can alleviate IOMMU TLB pressure and improve the hardware performance in the I/O virtualization environment. The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The spec can be found at: http://www.pcisig.com/specifications/iov/ats/ (it requires membership). Changelog: v3 - v4 1, coding style fixes (Grant Grundler) 2, support the Virtual Function ATS capability v2 - v3 1, throw error message if VT-d hardware detects invalid descriptor on Queued Invalidation interface (David Woodhouse) 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability PCI: handle Virtual Function ATS enabling VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c | 189 +++--- drivers/pci/intel-iommu.c | 140 ++-- drivers/pci/iov.c | 155 ++-- drivers/pci/pci.h | 39 + include/linux/dmar.h|9 ++ include/linux/intel-iommu.h | 16 - include/linux/pci.h |2 + include/linux/pci_regs.h| 10 +++ 8 files changed, 515 insertions(+), 45 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 resend 1/6] PCI: support the ATS capability
The PCIe ATS capability makes the Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the device side, thus alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c| 105 ++ drivers/pci/pci.h| 37 include/linux/pci.h |2 + include/linux/pci_regs.h | 10 4 files changed, 154 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index b497daa..0a7a1b4 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -5,6 +5,7 @@ * * PCI Express I/O Virtualization (IOV) support. * Single Root IOV 1.0 + * Address Translation Service 1.0 */ #include linux/pci.h @@ -679,3 +680,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev) return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; } EXPORT_SYMBOL_GPL(pci_sriov_migration); + +static int ats_alloc_one(struct pci_dev *dev, int ps) +{ + int pos; + u16 cap; + struct pci_ats *ats; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + ats = kzalloc(sizeof(*ats), GFP_KERNEL); + if (!ats) + return -ENOMEM; + + ats-pos = pos; + ats-stu = ps; + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; + dev-ats = ats; + + return 0; +} + +static void ats_free_one(struct pci_dev *dev) +{ + kfree(dev-ats); + dev-ats = NULL; +} + +/** + * pci_enable_ats - enable the ATS capability + * @dev: the PCI device + * @ps: the IOMMU page shift + * + * Returns 0 on success, or negative on failure. + */ +int pci_enable_ats(struct pci_dev *dev, int ps) +{ + int rc; + u16 ctrl; + + BUG_ON(dev-ats); + + if (ps PCI_ATS_MIN_STU) + return -EINVAL; + + rc = ats_alloc_one(dev, ps); + if (rc) + return rc; + + ctrl = PCI_ATS_CTRL_ENABLE; + ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + return 0; +} + +/** + * pci_disable_ats - disable the ATS capability + * @dev: the PCI device + */ +void pci_disable_ats(struct pci_dev *dev) +{ + u16 ctrl; + + BUG_ON(!dev-ats); + + pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + ctrl = ~PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + ats_free_one(dev); +} + +/** + * pci_ats_queue_depth - query the ATS Invalidate Queue Depth + * @dev: the PCI device + * + * Returns the queue depth on success, or negative on failure. + * + * The ATS spec uses 0 in the Invalidate Queue Depth field to + * indicate that the function can accept 32 Invalidate Request. + * But here we use the `real' values (i.e. 1~32) for the Queue + * Depth. + */ +int pci_ats_queue_depth(struct pci_dev *dev) +{ + int pos; + u16 cap; + + if (dev-ats) + return dev-ats-qdep; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + + return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d03f6b9..3c2ec64 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -229,6 +229,13 @@ struct pci_sriov { u8 __iomem *mstate; /* VF Migration State Array */ }; +/* Address Translation Service */ +struct pci_ats { + int pos;/* capability position */ + int stu;/* Smallest Translation Unit */ + int qdep; /* Invalidate Queue Depth */ +}; + #ifdef CONFIG_PCI_IOV extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); @@ -236,6 +243,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); extern int pci_iov_bus_range(struct pci_bus *bus); + +extern int pci_enable_ats(struct pci_dev *dev, int ps); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_queue_depth(struct pci_dev *dev); +/** + * pci_ats_enabled - query the ATS status + * @dev: the PCI device + * + * Returns 1 if ATS capability is enabled, or 0 if not. + */ +static inline int pci_ats_enabled(struct pci_dev *dev) +{ + return !!dev-ats; +} #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -257,6 +278,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus) { return 0; } + +static inline int
[PATCH v4 resend 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in the DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 112 -- include/linux/dmar.h|9 include/linux/intel-iommu.h |1 + 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index fa3a113..eaa405f 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -267,6 +267,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +static LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int rc; + struct acpi_dmar_atsr *atsr; + + if (atsru-include_all) + return 0; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + rc = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + if (rc || !atsru-devices_cnt) { + list_del(atsru-list); + kfree(atsru); + } + + return rc; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -274,22 +352,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -363,6 +447,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n); @@ -431,11 +520,19 @@ int __init dmar_dev_scope_init(void) #ifdef CONFIG_DMAR { struct
[PATCH v4 resend 2/6] PCI: handle Virtual Function ATS enabling
The SR-IOV spec requires that the Smallest Translation Unit and the Invalidate Queue Depth fields in the Virtual Function ATS capability are hardwired to 0. If a function is a Virtual Function, then and set its Physical Function's STU before enabling the ATS. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 66 +--- drivers/pci/pci.h |4 ++- 2 files changed, 55 insertions(+), 15 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0a7a1b4..4151404 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -491,10 +491,10 @@ found: if (pdev) iov-dev = pci_dev_get(pdev); - else { + else iov-dev = dev; - mutex_init(iov-lock); - } + + mutex_init(iov-lock); dev-sriov = iov; dev-is_physfn = 1; @@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev) { BUG_ON(dev-sriov-nr_virtfn); - if (dev == dev-sriov-dev) - mutex_destroy(dev-sriov-lock); - else + if (dev != dev-sriov-dev) pci_dev_put(dev-sriov-dev); + mutex_destroy(dev-sriov-lock); + kfree(dev-sriov); dev-sriov = NULL; } @@ -723,19 +723,40 @@ int pci_enable_ats(struct pci_dev *dev, int ps) int rc; u16 ctrl; - BUG_ON(dev-ats); + BUG_ON(dev-ats dev-ats-is_enabled); if (ps PCI_ATS_MIN_STU) return -EINVAL; - rc = ats_alloc_one(dev, ps); - if (rc) - return rc; + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + if (pdev-ats) + rc = pdev-ats-stu == ps ? 0 : -EINVAL; + else + rc = ats_alloc_one(pdev, ps); + + if (!rc) + pdev-ats-ref_cnt++; + mutex_unlock(pdev-sriov-lock); + if (rc) + return rc; + } + + if (!dev-is_physfn) { + rc = ats_alloc_one(dev, ps); + if (rc) + return rc; + } ctrl = PCI_ATS_CTRL_ENABLE; - ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); + if (!dev-is_virtfn) + ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU); pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + dev-ats-is_enabled = 1; + return 0; } @@ -747,13 +768,26 @@ void pci_disable_ats(struct pci_dev *dev) { u16 ctrl; - BUG_ON(!dev-ats); + BUG_ON(!dev-ats || !dev-ats-is_enabled); pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); ctrl = ~PCI_ATS_CTRL_ENABLE; pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); - ats_free_one(dev); + dev-ats-is_enabled = 0; + + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + pdev-ats-ref_cnt--; + if (!pdev-ats-ref_cnt) + ats_free_one(pdev); + mutex_unlock(pdev-sriov-lock); + } + + if (!dev-is_physfn) + ats_free_one(dev); } /** @@ -765,13 +799,17 @@ void pci_disable_ats(struct pci_dev *dev) * The ATS spec uses 0 in the Invalidate Queue Depth field to * indicate that the function can accept 32 Invalidate Request. * But here we use the `real' values (i.e. 1~32) for the Queue - * Depth. + * Depth; and 0 indicates the function shares the Queue with + * other functions (doesn't exclusively own a Queue). */ int pci_ats_queue_depth(struct pci_dev *dev) { int pos; u16 cap; + if (dev-is_virtfn) + return 0; + if (dev-ats) return dev-ats-qdep; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 3c2ec64..f73bcbe 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -234,6 +234,8 @@ struct pci_ats { int pos;/* capability position */ int stu;/* Smallest Translation Unit */ int qdep; /* Invalidate Queue Depth */ + int ref_cnt;/* Physical Function reference count */ + int is_enabled:1; /* Enable bit is set */ }; #ifdef CONFIG_PCI_IOV @@ -255,7 +257,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev); */ static inline int pci_ats_enabled(struct pci_dev *dev) { - return !!dev-ats; + return dev-ats dev-ats-is_enabled; } #else static inline int pci_iov_init(struct pci_dev *dev) -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 resend 4/6] VT-d: add device IOTLB invalidation support
Support device IOTLB invalidation to flush the translation cached in the Endpoint. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 77 ++ include/linux/intel-iommu.h | 14 +++- 2 files changed, 82 insertions(+), 9 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index eaa405f..6afd804 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -690,7 +690,8 @@ void free_iommu(struct intel_iommu *iommu) */ static inline void reclaim_free_desc(struct q_inval *qi) { - while (qi-desc_status[qi-free_tail] == QI_DONE) { + while (qi-desc_status[qi-free_tail] == QI_DONE || + qi-desc_status[qi-free_tail] == QI_ABORT) { qi-desc_status[qi-free_tail] = QI_FREE; qi-free_tail = (qi-free_tail + 1) % QI_LENGTH; qi-free_cnt++; @@ -700,10 +701,13 @@ static inline void reclaim_free_desc(struct q_inval *qi) static int qi_check_fault(struct intel_iommu *iommu, int index) { u32 fault; - int head; + int head, tail; struct q_inval *qi = iommu-qi; int wait_index = (index + 1) % QI_LENGTH; + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + fault = readl(iommu-reg + DMAR_FSTS_REG); /* @@ -713,7 +717,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ if (fault DMA_FSTS_IQE) { head = readl(iommu-reg + DMAR_IQH_REG); - if ((head 4) == index) { + if ((head DMAR_IQ_SHIFT) == index) { + printk(KERN_ERR VT-d detected invalid descriptor: + low=%llx, high=%llx\n, + (unsigned long long)qi-desc[index].low, + (unsigned long long)qi-desc[index].high); memcpy(qi-desc[index], qi-desc[wait_index], sizeof(struct qi_desc)); __iommu_flush_cache(iommu, qi-desc[index], @@ -723,6 +731,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) } } + /* +* If ITE happens, all pending wait_desc commands are aborted. +* No new descriptors are fetched until the ITE is cleared. +*/ + if (fault DMA_FSTS_ITE) { + head = readl(iommu-reg + DMAR_IQH_REG); + head = ((head DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH; + head |= 1; + tail = readl(iommu-reg + DMAR_IQT_REG); + tail = ((tail DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH; + + writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG); + + do { + if (qi-desc_status[head] == QI_IN_USE) + qi-desc_status[head] = QI_ABORT; + head = (head - 2 + QI_LENGTH) % QI_LENGTH; + } while (head != tail); + + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + } + + if (fault DMA_FSTS_ICE) + writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG); + return 0; } @@ -732,7 +766,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { - int rc = 0; + int rc; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; @@ -743,6 +777,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw = qi-desc; +restart: + rc = 0; + spin_lock_irqsave(qi-q_lock, flags); while (qi-free_cnt 3) { spin_unlock_irqrestore(qi-q_lock, flags); @@ -773,7 +810,7 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) * update the HW tail register indicating the presence of * new descriptors. */ - writel(qi-free_head 4, iommu-reg + DMAR_IQT_REG); + writel(qi-free_head DMAR_IQ_SHIFT, iommu-reg + DMAR_IQT_REG); while (qi-desc_status[wait_index] != QI_DONE) { /* @@ -785,18 +822,21 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) */ rc = qi_check_fault(iommu, index); if (rc) - goto out; + break; spin_unlock(qi-q_lock); cpu_relax(); spin_lock(qi-q_lock); } -out: - qi-desc_status[index] = qi-desc_status[wait_index] = QI_DONE; + + qi-desc_status[index] = QI_DONE; reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + if (rc == -EAGAIN) + goto restart; + return rc; } @@ -863,6 +903,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr
[PATCH v4 resend 6/6] VT-d: support the device IOTLB
Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 100 +- include/linux/intel-iommu.h |1 + 2 files changed, 98 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index a2cbc01..661a02b 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -128,6 +128,7 @@ static inline void context_set_fault_enable(struct context_entry *context) } #define CONTEXT_TT_MULTI_LEVEL 0 +#define CONTEXT_TT_DEV_IOTLB 1 static inline void context_set_translation_type(struct context_entry *context, unsigned long value) @@ -251,6 +252,7 @@ struct device_domain_info { int segment;/* PCI domain */ u8 bus; /* PCI bus number */ u8 devfn; /* PCI devfn number */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -965,6 +967,81 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, return 0; } +static struct device_domain_info * +iommu_support_dev_iotlb(struct dmar_domain *domain, + int segment, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(segment, bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found || !info-dev) + return NULL; + + if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS)) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + if (!info) + return; + + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (!info-dev || !pci_ats_enabled(info-dev)) + return; + + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned mask) +{ + int rc; + u16 sid, qdep; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + qdep = pci_ats_queue_depth(info-dev); + rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask); + if (rc) + dev_err(info-dev-dev, flush IOTLB failed\n); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { @@ -988,6 +1065,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH, non_present_entry_flush); + if (!rc !non_present_entry_flush) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); + return rc; } @@ -1329,6 +1409,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1394,7 +1475,9 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, segment, bus, devfn); + context_set_translation_type(context, + info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL); context_set_fault_enable(context); context_set_present(context
Re: KVM x86_64 with SR-IOV..?
Hi, The VF also works in the host if the VF driver is programed properly. So it would be easier to develop the VF driver in the host and then verify the VF driver in the guest. BTW, I didn't see the SR-IOV is enabled in your dmesg, did you select the CONFIG_PCI_IOV in the kernel .config? Thanks, Yu On Mon, May 04, 2009 at 06:40:36PM +0800, Nicholas A. Bellinger wrote: On Mon, 2009-05-04 at 17:49 +0800, Sheng Yang wrote: On Monday 04 May 2009 17:11:59 Nicholas A. Bellinger wrote: On Mon, 2009-05-04 at 16:20 +0800, Sheng Yang wrote: On Monday 04 May 2009 12:36:04 Nicholas A. Bellinger wrote: On Mon, 2009-05-04 at 10:09 +0800, Sheng Yang wrote: On Monday 04 May 2009 08:53:07 Nicholas A. Bellinger wrote: On Sat, 2009-05-02 at 18:22 +0800, Sheng Yang wrote: On Thu, Apr 30, 2009 at 01:22:54PM -0700, Nicholas A. Bellinger wrote: Greetings KVM folks, I wondering if any information exists for doing SR-IOV on the new VT-d capable chipsets with KVM..? From what I understand the patches for doing this with KVM are floating around, but I have been unable to find any user-level docs for actually making it all go against a upstream v2.6.30-rc3 code.. So far I have been doing IOV testing with Xen 3.3 and 3.4.0-pre, and I am really hoping to be able to jump to KVM for single-function and and then multi-function SR-IOV. I know that the VM migration stuff for IOV in Xen is up and running, and I assume it is being worked in for KVM instance migration as well..? This part is less important (at least for me :-) than getting a stable SR-IOV setup running under the KVM hypervisor.. Does anyone have any pointers for this..? Any comments or suggestions are appreciated! Hi Nicholas The patches are not floating around now. As you know, SR-IOV for Linux have been in 2.6.30, so then you can use upstream KVM and qemu-kvm(or recent released kvm-85) with 2.6.30-rc3 as host kernel. And some time ago, there are several SRIOV related patches for qemu-kvm, and now they all have been checked in. And for KVM, the extra document is not necessary, for you can simple assign a VF to guest like any other devices. And how to create VF is specific for each device driver. So just create a VF then assign it to KVM guest is fine. Greetings Sheng, So, I have been trying the latest kvm-85 release on a v2.6.30-rc3 checkout from linux-2.6.git on a CentOS 5u3 x86_64 install on Intel IOH-5520 based dual socket Nehalem board. I have enabled DMAR and Interrupt Remapping my KVM host using v2.6.30-rc3 and from what I can tell, the KVM_CAP_* defines from libkvm are enabled with building kvm-85 after './configure --kerneldir=/usr/src/linux-2.6.git' and the PCI passthrough code is being enabled in kvm-85/qemu/hw/device-assignment.c AFAICT.. From there, I use the freshly installed qemu-x86_64-system binary to start a Debian 5 x86_64 HVM (that previously had been moving network packets under Xen for PCIe passthrough). I see the MSI-X interrupt remapping working on the KVM host for the passed -pcidevice, and the MMIO mappings from the qemu build that I also saw while using Xen/qemu-dm built with PCI passthrough are there as well.. Hi Nicholas But while the KVM guest is booting, I see the following exception(s) from qemu-x86_64-system for one of the VFs for a multi-function PCIe device: BUG: kvm_destroy_phys_mem: invalid parameters (slot=-1) This one is mostly harmless. Ok, good to know.. :-) I try with one of the on-board e1000e ports (02:00.0) and I see the same exception along with some MSI-X exceptions from qemu-x86_64-system in KVM guest.. However, I am still able to see the e1000e and the other vxge multi-function device with lspci, but I am unable to dhcp or ping with the e1000e and VF from multi-function device fails to register the MSI-X interrupt in the guest.. Did you see the interrupt in the guest and host side? Ok, I am restarting the e1000e test with a fresh Fedora 11 install and KVM host kernel 2.6.29.1-111.fc11.x86_64. After unbinding and attaching the e1000e single-function device at 02:00.0 to pci-stub with: echo 8086 10d3 /sys/bus/pci/drivers/pci-stub/new_id echo :02:00.0 /sys/bus/pci/devices/:02:00.0/driver/unbind echo :02:00.0 /sys/bus/pci/drivers/pci-stub/bind I see the following the KVM host kernel ring buffer:
[RFC PATCH 1/3] PCI: rewrite Function Level Reset
Changes: 1) remove disable_irq() so the shared IRQ won't be disabled. 2) replace the 1s wait with 100, 200 and 400ms wait intervals for the Pending Transaction. 3) replace mdelay() with msleep(). 4) add might_sleep(). 5) lock the device to prevent PM suspend from accessing the CSRs during the reset. 6) coding style fixes. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/pci.c | 166 ++- include/linux/pci.h |2 +- 2 files changed, 85 insertions(+), 83 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index af4db4e..46ae997 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2008,111 +2008,112 @@ int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask) EXPORT_SYMBOL(pci_set_dma_seg_boundary); #endif -static int __pcie_flr(struct pci_dev *dev, int probe) +static int pcie_flr(struct pci_dev *dev, int probe) { - u16 status; + int i; + int pos; u32 cap; - int exppos = pci_find_capability(dev, PCI_CAP_ID_EXP); + u16 status; - if (!exppos) + pos = pci_find_capability(dev, PCI_CAP_ID_EXP); + if (!pos) return -ENOTTY; - pci_read_config_dword(dev, exppos + PCI_EXP_DEVCAP, cap); + + pci_read_config_dword(dev, pos + PCI_EXP_DEVCAP, cap); if (!(cap PCI_EXP_DEVCAP_FLR)) return -ENOTTY; if (probe) return 0; - pci_block_user_cfg_access(dev); - /* Wait for Transaction Pending bit clean */ - pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status); - if (!(status PCI_EXP_DEVSTA_TRPND)) - goto transaction_done; + for (i = 0; i 4; i++) { + if (i) + msleep((1 (i - 1)) * 100); - msleep(100); - pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status); - if (!(status PCI_EXP_DEVSTA_TRPND)) - goto transaction_done; - - dev_info(dev-dev, Busy after 100ms while trying to reset; - sleeping for 1 second\n); - ssleep(1); - pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status); - if (status PCI_EXP_DEVSTA_TRPND) - dev_info(dev-dev, Still busy after 1s; - proceeding with reset anyway\n); - -transaction_done: - pci_write_config_word(dev, exppos + PCI_EXP_DEVCTL, + pci_read_config_word(dev, pos + PCI_EXP_DEVSTA, status); + if (!(status PCI_EXP_DEVSTA_TRPND)) + goto clear; + } + + dev_err(dev-dev, transaction is not cleared; + proceeding with reset anyway\n); + +clear: + pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR); - mdelay(100); + msleep(100); - pci_unblock_user_cfg_access(dev); return 0; } -static int __pci_af_flr(struct pci_dev *dev, int probe) +static int pci_af_flr(struct pci_dev *dev, int probe) { - int cappos = pci_find_capability(dev, PCI_CAP_ID_AF); - u8 status; + int i; + int pos; u8 cap; + u8 status; - if (!cappos) + pos = pci_find_capability(dev, PCI_CAP_ID_AF); + if (!pos) return -ENOTTY; - pci_read_config_byte(dev, cappos + PCI_AF_CAP, cap); + + pci_read_config_byte(dev, pos + PCI_AF_CAP, cap); if (!(cap PCI_AF_CAP_TP) || !(cap PCI_AF_CAP_FLR)) return -ENOTTY; if (probe) return 0; - pci_block_user_cfg_access(dev); - /* Wait for Transaction Pending bit clean */ - pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status); - if (!(status PCI_AF_STATUS_TP)) - goto transaction_done; + for (i = 0; i 4; i++) { + if (i) + msleep((1 (i - 1)) * 100); + + pci_read_config_byte(dev, pos + PCI_AF_STATUS, status); + if (!(status PCI_AF_STATUS_TP)) + goto clear; + } + + dev_err(dev-dev, transaction is not cleared; + proceeding with reset anyway\n); +clear: + pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR); msleep(100); - pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status); - if (!(status PCI_AF_STATUS_TP)) - goto transaction_done; - - dev_info(dev-dev, Busy after 100ms while trying to -reset; sleeping for 1 second\n); - ssleep(1); - pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status); - if (status PCI_AF_STATUS_TP) - dev_info(dev-dev, Still busy after 1s; - proceeding with reset anyway\n); - -transaction_done: - pci_write_config_byte(dev, cappos + PCI_AF_CTRL, PCI_AF_CTRL_FLR); - mdelay(100
[RFC PATCH 3/3] PCI: support Secondary Bus Reset
PCI-to-PCI Bridge 1.2 specifies that the Secondary Bus Reset bit can force the assertion of RST# on the secondary interface, which can be used to reset all devices including subordinates under this bus. This can be used to reset a function if this function is the only device under this bus. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/pci.c | 31 +++ 1 files changed, 31 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e459a0b..a77c33a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2115,6 +2115,33 @@ static int pci_pm_flr(struct pci_dev *dev, int probe) return 0; } +static int pci_secondary_bus_reset(struct pci_dev *dev, int probe) +{ + u16 ctrl; + struct pci_dev *pdev; + + if (dev-subordinate) + return -ENOTTY; + + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev != dev) + return -ENOTTY; + + if (probe) + return 0; + + pci_read_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl); + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl); + msleep(100); + + ctrl = ~PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl); + msleep(100); + + return 0; +} + static int pci_dev_reset(struct pci_dev *dev, int probe) { int rc; @@ -2136,6 +2163,10 @@ static int pci_dev_reset(struct pci_dev *dev, int probe) goto done; rc = pci_pm_flr(dev, probe); + if (rc != -ENOTTY) + goto done; + + rc = pci_secondary_bus_reset(dev, probe); done: up(dev-dev.sem); -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] PCI: support the ATS capability
On Sun, Mar 29, 2009 at 09:51:31PM +0800, Matthew Wilcox wrote: On Thu, Mar 26, 2009 at 04:15:56PM -0700, Jesse Barnes wrote: 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) I asked a question about how that was used, and got back a version which changed how it was done. I still don't have an answer to my question. VT-d hardware is designed as that the Invalidate Queue Depth is used every time when the software prepares the Invalidate Request descriptor. This happens when the device IOMMU mapping changes (i.e. device driver calls DMA map/unmap if the device is use by the host; or when a guest is started/destroyed if the device is assigned to this guest). Given the DMA map/unmap are used very frequently, I suppose the queue depth should be cached somewhere. And it used to be cached in the VT-d private data structure (before v3) because I'm not sure about how the IOMMU hardware from other vendors use the queue depth. After you commented the code, I checked the AMD/IBM/Sun IOMMU: AMD IOMMU also uses the invalidate queue for every Invalidate Request descriptor; IBM/Sun IOMMUs don't look like supporting the ATS. So it's reasonable to cache the queue depth in the PCI subsystem since all IOMMUs that support the ATS use the queue depth in the same way (very frequently), right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/6] PCI: support the ATS capability
This patch series implements Address Translation Service support for the Intel IOMMU. The PCIe Endpoint that supports ATS capability can request the DMA address translation from the IOMMU and cache the translation itself. This can alleviate IOMMU TLB pressure and improve the hardware performance in the I/O virtualization environment. The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The spec can be found at: http://www.pcisig.com/specifications/iov/ats/ (it requires membership). Changelog: v3 - v4 1, coding style fixes (Grant Grundler) 2, support the Virtual Function ATS capability v2 - v3 1, throw error message if VT-d hardware detects invalid descriptor on Queued Invalidation interface (David Woodhouse) 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability PCI: handle Virtual Function ATS enabling VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c | 189 +++--- drivers/pci/intel-iommu.c | 139 ++-- drivers/pci/iov.c | 155 ++-- drivers/pci/pci.h | 39 + include/linux/dmar.h|9 ++ include/linux/intel-iommu.h | 16 - include/linux/pci.h |2 + include/linux/pci_regs.h| 10 +++ 8 files changed, 514 insertions(+), 45 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/6] PCI: support the ATS capability
The PCIe ATS capability makes the Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the device side, thus alleviate IOMMU TLB pressure and improve the hardware performance in the I/O virtualization environment. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c| 105 ++ drivers/pci/pci.h| 37 include/linux/pci.h |2 + include/linux/pci_regs.h | 10 4 files changed, 154 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 7227efc..8a9817c 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -5,6 +5,7 @@ * * PCI Express I/O Virtualization (IOV) support. * Single Root IOV 1.0 + * Address Translation Service 1.0 */ #include linux/pci.h @@ -678,3 +679,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev) return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; } EXPORT_SYMBOL_GPL(pci_sriov_migration); + +static int ats_alloc_one(struct pci_dev *dev, int pgshift) +{ + int pos; + u16 cap; + struct pci_ats *ats; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + ats = kzalloc(sizeof(*ats), GFP_KERNEL); + if (!ats) + return -ENOMEM; + + ats-pos = pos; + ats-stu = pgshift; + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; + dev-ats = ats; + + return 0; +} + +static void ats_free_one(struct pci_dev *dev) +{ + kfree(dev-ats); + dev-ats = NULL; +} + +/** + * pci_enable_ats - enable the ATS capability + * @dev: the PCI device + * @pgshift: the IOMMU page shift + * + * Returns 0 on success, or negative on failure. + */ +int pci_enable_ats(struct pci_dev *dev, int pgshift) +{ + int rc; + u16 ctrl; + + BUG_ON(dev-ats); + + if (pgshift PCI_ATS_MIN_STU) + return -EINVAL; + + rc = ats_alloc_one(dev, pgshift); + if (rc) + return rc; + + ctrl = PCI_ATS_CTRL_ENABLE; + ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU); + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + return 0; +} + +/** + * pci_disable_ats - disable the ATS capability + * @dev: the PCI device + */ +void pci_disable_ats(struct pci_dev *dev) +{ + u16 ctrl; + + BUG_ON(!dev-ats); + + pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + ctrl = ~PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + + ats_free_one(dev); +} + +/** + * pci_ats_queue_depth - query the ATS Invalidate Queue Depth + * @dev: the PCI device + * + * Returns the queue depth on success, or negative on failure. + * + * The ATS spec uses 0 in the Invalidate Queue Depth field to + * indicate that the function can accept 32 Invalidate Request. + * But here we use the `real' values (i.e. 1~32) for the Queue + * Depth. + */ +int pci_ats_queue_depth(struct pci_dev *dev) +{ + int pos; + u16 cap; + + if (dev-ats) + return dev-ats-qdep; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + + return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : + PCI_ATS_MAX_QDEP; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index dd7c63f..9f0db6a 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -218,6 +218,13 @@ struct pci_sriov { u8 __iomem *mstate; /* VF Migration State Array */ }; +/* Address Translation Service */ +struct pci_ats { + int pos;/* capability position */ + int stu;/* Smallest Translation Unit */ + int qdep; /* Invalidate Queue Depth */ +}; + #ifdef CONFIG_PCI_IOV extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); @@ -225,6 +232,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); extern int pci_iov_bus_range(struct pci_bus *bus); + +extern int pci_enable_ats(struct pci_dev *dev, int pgshift); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_queue_depth(struct pci_dev *dev); +/** + * pci_ats_enabled - query the ATS status + * @dev: the PCI device + * + * Returns 1 if ATS capability is enabled, or 0 if not. + */ +static inline int pci_ats_enabled(struct pci_dev *dev) +{ + return !!dev-ats; +} #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -246,6 +267,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus
[PATCH v4 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in the DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 112 -- include/linux/dmar.h|9 include/linux/intel-iommu.h |1 + 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index 26c536b..106bc45 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +static LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int rc; + struct acpi_dmar_atsr *atsr; + + if (atsru-include_all) + return 0; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + rc = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + if (rc || !atsru-devices_cnt) { + list_del(atsru-list); + kfree(atsru); + } + + return rc; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -349,6 +433,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n); @@ -417,11 +506,19 @@ int __init dmar_dev_scope_init(void) #ifdef CONFIG_DMAR { struct
[PATCH v4 2/6] PCI: handle Virtual Function ATS enabling
The SR-IOV spec requires the Smallest Translation Unit and the Invalidate Queue Depth fields in the Virtual Function ATS capability to be hardwired to 0. If a function is a Virtual Function, then and set its Physical Function's STU before enabling the ATS. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 66 +--- drivers/pci/pci.h |4 ++- 2 files changed, 55 insertions(+), 15 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 8a9817c..0bf23fc 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -491,10 +491,10 @@ found: if (pdev) iov-dev = pci_dev_get(pdev); - else { + else iov-dev = dev; - mutex_init(iov-lock); - } + + mutex_init(iov-lock); dev-sriov = iov; dev-is_physfn = 1; @@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev) { BUG_ON(dev-sriov-nr_virtfn); - if (dev == dev-sriov-dev) - mutex_destroy(dev-sriov-lock); - else + if (dev != dev-sriov-dev) pci_dev_put(dev-sriov-dev); + mutex_destroy(dev-sriov-lock); + kfree(dev-sriov); dev-sriov = NULL; } @@ -722,19 +722,40 @@ int pci_enable_ats(struct pci_dev *dev, int pgshift) int rc; u16 ctrl; - BUG_ON(dev-ats); + BUG_ON(dev-ats dev-ats-is_enabled); if (pgshift PCI_ATS_MIN_STU) return -EINVAL; - rc = ats_alloc_one(dev, pgshift); - if (rc) - return rc; + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + if (pdev-ats) + rc = pdev-ats-stu == pgshift ? 0 : -EINVAL; + else + rc = ats_alloc_one(pdev, pgshift); + + if (!rc) + pdev-ats-ref_cnt++; + mutex_unlock(pdev-sriov-lock); + if (rc) + return rc; + } + + if (!dev-is_physfn) { + rc = ats_alloc_one(dev, pgshift); + if (rc) + return rc; + } ctrl = PCI_ATS_CTRL_ENABLE; - ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU); + if (!dev-is_virtfn) + ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU); pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); + dev-ats-is_enabled = 1; + return 0; } @@ -746,13 +767,26 @@ void pci_disable_ats(struct pci_dev *dev) { u16 ctrl; - BUG_ON(!dev-ats); + BUG_ON(!dev-ats || !dev-ats-is_enabled); pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); ctrl = ~PCI_ATS_CTRL_ENABLE; pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl); - ats_free_one(dev); + dev-ats-is_enabled = 0; + + if (dev-is_physfn || dev-is_virtfn) { + struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn; + + mutex_lock(pdev-sriov-lock); + pdev-ats-ref_cnt--; + if (!pdev-ats-ref_cnt) + ats_free_one(pdev); + mutex_unlock(pdev-sriov-lock); + } + + if (!dev-is_physfn) + ats_free_one(dev); } /** @@ -764,13 +798,17 @@ void pci_disable_ats(struct pci_dev *dev) * The ATS spec uses 0 in the Invalidate Queue Depth field to * indicate that the function can accept 32 Invalidate Request. * But here we use the `real' values (i.e. 1~32) for the Queue - * Depth. + * Depth; and 0 indicates the function shares the Queue with + * other functions (doesn't exclusively own a Queue). */ int pci_ats_queue_depth(struct pci_dev *dev) { int pos; u16 cap; + if (dev-is_virtfn) + return 0; + if (dev-ats) return dev-ats-qdep; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9f0db6a..8ecd185 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -223,6 +223,8 @@ struct pci_ats { int pos;/* capability position */ int stu;/* Smallest Translation Unit */ int qdep; /* Invalidate Queue Depth */ + int ref_cnt;/* Physical Function reference count */ + int is_enabled:1; /* Enable bit is set */ }; #ifdef CONFIG_PCI_IOV @@ -244,7 +246,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev); */ static inline int pci_ats_enabled(struct pci_dev *dev) { - return !!dev-ats; + return dev-ats dev-ats-is_enabled; } #else static inline int pci_iov_init(struct pci_dev *dev) -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 6/6] VT-d: support the device IOTLB
Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 99 +- include/linux/intel-iommu.h |1 + 2 files changed, 97 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 3145368..799bbe5 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -127,6 +127,7 @@ static inline void context_set_fault_enable(struct context_entry *context) } #define CONTEXT_TT_MULTI_LEVEL 0 +#define CONTEXT_TT_DEV_IOTLB 1 static inline void context_set_translation_type(struct context_entry *context, unsigned long value) @@ -242,6 +243,7 @@ struct device_domain_info { struct list_head global; /* link to global list */ u8 bus; /* PCI bus numer */ u8 devfn; /* PCI devfn number */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -924,6 +926,80 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, return 0; } +static struct device_domain_info * +iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found || !info-dev) + return NULL; + + if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS)) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + if (!info) + return; + + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (!info-dev || !pci_ats_enabled(info-dev)) + return; + + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned mask) +{ + int rc; + u16 sid, qdep; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + qdep = pci_ats_queue_depth(info-dev); + rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask); + if (rc) + dev_err(info-dev-dev, flush IOTLB failed\n); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { @@ -947,6 +1023,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH, non_present_entry_flush); + if (!rc !non_present_entry_flush) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); + return rc; } @@ -1471,6 +1550,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1536,7 +1616,9 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, bus, devfn); + context_set_translation_type(context, + info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL); context_set_fault_enable(context); context_set_present(context); domain_flush_cache(domain, context
Re: [PATCH v3 0/6] ATS capability support for Intel IOMMU
On Fri, Mar 20, 2009 at 07:15:51PM +0800, David Woodhouse wrote: On Fri, 2009-03-20 at 10:47 +0800, Zhao, Yu wrote: If it's possible, I'd like it go through the PCI tree because the ATS depends on the SR-IOV. This dependency is not reflected in this v3 series since the SR-IOV is not in-tree and I don't want to break the build after people apply the ATS on their tree. In what way will it depend on SR-IOV? The SR-IOV spec section 3.7.4 says that the Smallest Translation Unit and the Invalidate Queue Depth fields in the Virtual Function's ATS capability are hard-wired to 0. So we need some special handling when enabling the ATS capability for the Virtual Function. Table 3-26: ATS Capability Register -+-+---+-- Bit Location | PF and VF Register Differences From ATS | PF Attributes | VF Attributes -+-+---+-- |Smallest Translation Unit (STU) | | 20:16 |Hardwired to 0 for VFs. | ATS | RO |PF value applies to all VFs. | | -+-+---+-- |Invalidate Queue Depth | | 28:24 |Hardwired to 0 for VFs. | ATS | RO |Depth of shared PF input queue. | | -+-+---+-- So Dave, can I get an ack from you and let Jesse pull the IOMMU change to his tree? Or let this ATS go to 2.6.31? Want to show the latest version of the patches which depend on SR-IOV, and I can ack them? Sure, thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 1/8] PCI: initialize and release SR-IOV capability
On Fri, Mar 20, 2009 at 03:53:12AM +0800, Matthew Wilcox wrote: On Wed, Mar 11, 2009 at 03:25:42PM +0800, Yu Zhao wrote: +config PCI_IOV + bool PCI IOV support + depends on PCI + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the creation of virtual PCI devices + that share the physical resources from a real device. + + When in doubt, say N. It's certainly shorter than my text, which is nice. But I think it still has too much spec-ese and not enough explanation. How about: help I/O Virtualization is a PCI feature supported by some devices which allows them to create virtual devices which share their physical resources. If unsure, say N. Yes, it's more user-friendly. + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-is_physfn) + break; + if (list_empty(dev-bus-devices) || !pdev-is_physfn) + pdev = NULL; This is still wrong. If the 'break' condition is not hit, pdev is pointing to garbage, not to the last pci_dev in the list. Yes, you are right. I should think it over after you commented on it last time. So it looks like we need to make it as: ctrl = 0; list_for_each_entry(pdev, dev-bus-devices, bus_list) if (pdev-is_physfn) goto found; pdev = NULL; if (pci_ari_enabled(dev-bus)) ctrl |= PCI_SRIOV_CTRL_ARI; found: pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); ... @@ -270,6 +278,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + struct pci_sriov *sriov;/* SR-IOV capability related */ Should be ifdeffed? Yes, will do. Thank you for reviewing it. The patch series was applied on Xen Domain0 tree 2 days ago, and I'll carry your comments back to Xen tree too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 0/8] PCI: Linux kernel SR-IOV support
Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. SR-IOV specification can be found at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf (it requires membership.) Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf The patches to enable the SR-IOV capability of Intel 82576 NIC are available at (a.k.a Physical Function driver): http://patchwork.kernel.org/patch/8063/ http://patchwork.kernel.org/patch/8064/ http://patchwork.kernel.org/patch/8065/ http://patchwork.kernel.org/patch/8066/ And the driver for Intel 82576 Virtual Function are available at: http://patchwork.kernel.org/patch/11029/ http://patchwork.kernel.org/patch/11028/ Major changes from v11 to v12: 1, fix using garbage entry pointer after the list_for_each (Matthew Wilcox) 2, use #ifdef around SR-IOV structure in the pci_dev (Matthew Wilcox) 3, enhance the Kconfig help text for the SR-IOV (Matthew Wilcox) v10 to v11: 1, use pci_setup_device() to setup Virtual Function (Matthew Wilcox) 2, various coding style fixes (Matthew Wilcox) 3, wording and grammar fixes (Randy Dunlap) v9 - v10: 1, minor fix in pci_restore_iov_state(). 2, respin against the latest tree. v8 - v9: 1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen) 2, block user config accesses before clearing VF Enable bit (Matthew Wilcox) Yu Zhao (8): PCI: initialize and release SR-IOV capability PCI: restore saved SR-IOV state PCI: reserve bus range for SR-IOV device PCI: centralize device setup code PCI: add SR-IOV API for Physical Function driver PCI: handle SR-IOV Virtual Function Migration PCI: document SR-IOV sysfs entries PCI: manual for SR-IOV user and driver developer Documentation/ABI/testing/sysfs-bus-pci | 27 ++ Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + drivers/pci/Kconfig | 10 + drivers/pci/Makefile|2 + drivers/pci/iov.c | 680 +++ drivers/pci/pci.c |8 + drivers/pci/pci.h | 53 +++ drivers/pci/probe.c | 86 +++-- include/linux/pci.h | 34 ++ include/linux/pci_regs.h| 33 ++ 11 files changed, 994 insertions(+), 39 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt create mode 100644 drivers/pci/iov.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 1/8] PCI: initialize and release SR-IOV capability
If a device has the SR-IOV capability, initialize it (set the ARI Capable Hierarchy in the lowest numbered PF if necessary; calculate the System Page Size for the VF MMIO, probe the VF Offset, Stride and BARs). A lock for the VF bus allocation is also initialized if a PF is the lowest numbered PF. Reviewed-by: Matthew Wilcox wi...@linux.intel.com Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/Kconfig | 10 +++ drivers/pci/Makefile |2 + drivers/pci/iov.c| 182 ++ drivers/pci/pci.c|7 ++ drivers/pci/pci.h| 37 + drivers/pci/probe.c |4 + include/linux/pci.h | 11 +++ include/linux/pci_regs.h | 33 8 files changed, 286 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 2a4501d..fdc864f 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -59,3 +59,13 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool PCI IOV support + depends on PCI + help + I/O Virtualization is a PCI feature supported by some devices + which allows them to create virtual devices which share their + physical resources. + + If unsure, say N. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba6af16 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,8 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +obj-$(CONFIG_PCI_IOV) += iov.o + # # Some architectures use the generic PCI setup functions # diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 000..66cc414 --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,182 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com + * + * PCI Express I/O Virtualization (IOV) support. + * Single Root IOV 1.0 + */ + +#include linux/pci.h +#include linux/mutex.h +#include linux/string.h +#include linux/delay.h +#include pci.h + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + ssleep(1); + } + + pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total); + if (!total) + return 0; + + ctrl = 0; + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-is_physfn) + goto found; + + pdev = NULL; + if (pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + +found: + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; + + pgsz = ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_IOV_RESOURCES + i; + i += __pci_read_base(dev, pci_bar_unknown, res, +pos + PCI_SRIOV_BAR + i * 4); + if (!res-flags) + continue; + if (resource_size(res) (PAGE_SIZE - 1)) { + rc = -EIO; + goto failed; + } + res-end = res-start + resource_size(res) * total - 1; + nres++; + } + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) { + rc = -ENOMEM; + goto failed; + } + + iov-pos = pos; + iov-nres = nres; + iov-ctrl = ctrl; + iov-total = total; + iov-offset = offset; + iov-stride = stride; + iov-pgsz = pgsz; + iov-self = dev; + pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap); + pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link); + + if (pdev) + iov-dev = pci_dev_get(pdev); + else
[PATCH v12 8/8] PCI: manual for SR-IOV user and driver developer
Reviewed-by: Randy Dunlap rdun...@xenotime.net Reviewed-by: Matthew Wilcox wi...@linux.intel.com Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + 2 files changed, 100 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index bc962cd..58c1945 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c -- !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c /sect1 sect1titlePCI Hotplug Support Library/title !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000..fc73ef5 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,99 @@ + PCI Express I/O Virtualization Howto + Copyright (C) 2009 Intel Corporation + Yu Zhao yu.z...@intel.com + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +2. User Guide + +2.1 How can I enable SR-IOV capability + +The device driver (PF driver) will control the enabling and disabling +of the capability via API provided by SR-IOV core. If the hardware +has SR-IOV capability, loading its PF driver would enable it and all +VFs associated with the PF. + +2.2 How can I use the Virtual Functions + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +3. Developer Guide + +3.1 SR-IOV API + +To enable SR-IOV capability: + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + 'nr_virtfn' is number of VFs to be enabled. + +To disable SR-IOV capability: + void pci_disable_sriov(struct pci_dev *dev); + +To notify SR-IOV core of Virtual Function Migration: + irqreturn_t pci_sriov_migration(struct pci_dev *dev); + +3.2 Usage example + +Following piece of code illustrates the usage of the SR-IOV API. + +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) +{ + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + pci_disable_sriov(dev); + + ... +} + +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + ... + + return 0; +} + +static void dev_shutdown(struct pci_dev *dev) +{ + ... +} + +static struct pci_driver dev_driver = { + .name = SR-IOV Physical Function driver, + .id_table = dev_id_table, + .probe =dev_probe, + .remove = __devexit_p(dev_remove), + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, +}; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 0/8] PCI: Linux kernel SR-IOV support
Hi Matthew, Can you please take a look at this new version? I'd like to make sure that all concerns are addressed and I didn't miss something :-) Thanks, Yu On Wed, Mar 11, 2009 at 03:25:41PM +0800, Yu Zhao wrote: Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. SR-IOV specification can be found at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf (it requires membership.) Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf The patches to enable the SR-IOV capability of Intel 82576 NIC are available at (a.k.a Physical Function driver): http://patchwork.kernel.org/patch/8063/ http://patchwork.kernel.org/patch/8064/ http://patchwork.kernel.org/patch/8065/ http://patchwork.kernel.org/patch/8066/ And the driver for Intel 82576 Virtual Function are available at: http://patchwork.kernel.org/patch/11029/ http://patchwork.kernel.org/patch/11028/ Major changes from v10 to v11: 1, use pci_setup_device() to setup Virtual Function (Matthew Wilcox) 2, various coding style fixes (Matthew Wilcox) 3, wording and grammar fixes (Randy Dunlap) v9 - v10: 1, minor fix in pci_restore_iov_state(). 2, respin against the latest tree. v8 - v9: 1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen) 2, block user config accesses before clearing VF Enable bit (Matthew Wilcox) Yu Zhao (8): PCI: initialize and release SR-IOV capability PCI: restore saved SR-IOV state PCI: reserve bus range for SR-IOV device PCI: centralize device setup code into pci_setup_device() PCI: add SR-IOV API for Physical Function driver PCI: handle SR-IOV Virtual Function Migration PCI: document SR-IOV sysfs entries PCI: manual for SR-IOV user and driver developer Documentation/ABI/testing/sysfs-bus-pci | 27 ++ Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + drivers/pci/Kconfig | 10 + drivers/pci/Makefile|2 + drivers/pci/iov.c | 677 +++ drivers/pci/pci.c |8 + drivers/pci/pci.h | 53 +++ drivers/pci/probe.c | 86 +++-- include/linux/pci.h | 32 ++ include/linux/pci_regs.h| 33 ++ 11 files changed, 989 insertions(+), 39 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt create mode 100644 drivers/pci/iov.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 1/8] PCI: initialize and release SR-IOV capability
If a device has the SR-IOV capability, initialize it (set the ARI Capable Hierarchy in the lowest numbered PF if necessary; calculate the System Page Size for the VF MMIO, probe the VF Offset, Stride and BARs). A lock for the VF bus allocation is also initialized if a PF is the lowest numbered PF. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/Kconfig | 10 +++ drivers/pci/Makefile |2 + drivers/pci/iov.c| 182 ++ drivers/pci/pci.c|7 ++ drivers/pci/pci.h| 37 + drivers/pci/probe.c |4 + include/linux/pci.h |9 ++ include/linux/pci_regs.h | 33 8 files changed, 284 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 2a4501d..25cf360 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -59,3 +59,13 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool PCI IOV support + depends on PCI + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the creation of virtual PCI devices + that share the physical resources from a real device. + + When in doubt, say N. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba6af16 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,8 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +obj-$(CONFIG_PCI_IOV) += iov.o + # # Some architectures use the generic PCI setup functions # diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 000..656216c --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,182 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com + * + * PCI Express I/O Virtualization (IOV) support. + * Single Root IOV 1.0 + */ + +#include linux/pci.h +#include linux/mutex.h +#include linux/string.h +#include linux/delay.h +#include pci.h + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + ssleep(1); + } + + pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total); + if (!total) + return 0; + + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-is_physfn) + break; + if (list_empty(dev-bus-devices) || !pdev-is_physfn) + pdev = NULL; + + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; + + pgsz = ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_IOV_RESOURCES + i; + i += __pci_read_base(dev, pci_bar_unknown, res, +pos + PCI_SRIOV_BAR + i * 4); + if (!res-flags) + continue; + if (resource_size(res) (PAGE_SIZE - 1)) { + rc = -EIO; + goto failed; + } + res-end = res-start + resource_size(res) * total - 1; + nres++; + } + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) { + rc = -ENOMEM; + goto failed; + } + + iov-pos = pos; + iov-nres = nres; + iov-ctrl = ctrl; + iov-total = total; + iov-offset = offset; + iov-stride = stride; + iov-pgsz = pgsz; + iov-self = dev; + pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap); + pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link); + + if (pdev) + iov-dev
[PATCH v11 2/8] PCI: restore saved SR-IOV state
Restore the volatile registers in the SR-IOV capability after the D3-D0 transition. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 29 + drivers/pci/pci.c |1 + drivers/pci/pci.h |4 3 files changed, 34 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 656216c..8df2246 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -129,6 +129,25 @@ static void sriov_release(struct pci_dev *dev) dev-sriov = NULL; } +static void sriov_restore_state(struct pci_dev *dev) +{ + int i; + u16 ctrl; + struct pci_sriov *iov = dev-sriov; + + pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) + return; + + for (i = PCI_IOV_RESOURCES; i = PCI_IOV_RESOURCE_END; i++) + pci_update_resource(dev, i); + + pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + if (iov-ctrl PCI_SRIOV_CTRL_VFE) + msleep(100); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -180,3 +199,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno, return dev-sriov-pos + PCI_SRIOV_BAR + 4 * (resno - PCI_IOV_RESOURCES); } + +/** + * pci_restore_iov_state - restore the state of the IOV capability + * @dev: the PCI device + */ +void pci_restore_iov_state(struct pci_dev *dev) +{ + if (dev-is_physfn) + sriov_restore_state(dev); +} diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 2eba2a5..8e21912 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev) } pci_restore_pcix_state(dev); pci_restore_msi_state(dev); + pci_restore_iov_state(dev); return 0; } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 196be5e..efd79a2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern void pci_restore_iov_state(struct pci_dev *dev); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +static inline void pci_restore_iov_state(struct pci_dev *dev) +{ +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 4/8] PCI: centralize device setup code
Move the device setup stuff into pci_setup_device() which will be used to setup the Virtual Function later. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/pci.h |1 + drivers/pci/probe.c | 79 ++- 2 files changed, 41 insertions(+), 39 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 7abdef6..80ad848 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -178,6 +178,7 @@ enum pci_bar_type { pci_bar_mem64, /* A 64-bit memory BAR */ }; +extern int pci_setup_device(struct pci_dev *dev); extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); extern int pci_resource_bar(struct pci_dev *dev, int resno, diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 4c8abd0..f4ca550 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -674,6 +674,19 @@ static void pci_read_irq(struct pci_dev *dev) dev-irq = irq; } +static void set_pcie_port_type(struct pci_dev *pdev) +{ + int pos; + u16 reg16; + + pos = pci_find_capability(pdev, PCI_CAP_ID_EXP); + if (!pos) + return; + pdev-is_pcie = 1; + pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16); + pdev-pcie_type = (reg16 PCI_EXP_FLAGS_TYPE) 4; +} + #define LEGACY_IO_RESOURCE (IORESOURCE_IO | IORESOURCE_PCI_FIXED) /** @@ -683,12 +696,34 @@ static void pci_read_irq(struct pci_dev *dev) * Initialize the device structure with information about the device's * vendor,class,memory and IO-space addresses,IRQ lines etc. * Called at initialisation of the PCI subsystem and by CardBus services. - * Returns 0 on success and -1 if unknown type of device (not normal, bridge - * or CardBus). + * Returns 0 on success and negative if unknown type of device (not normal, + * bridge or CardBus). */ -static int pci_setup_device(struct pci_dev * dev) +int pci_setup_device(struct pci_dev *dev) { u32 class; + u8 hdr_type; + struct pci_slot *slot; + + if (pci_read_config_byte(dev, PCI_HEADER_TYPE, hdr_type)) + return -EIO; + + dev-sysdata = dev-bus-sysdata; + dev-dev.parent = dev-bus-bridge; + dev-dev.bus = pci_bus_type; + dev-hdr_type = hdr_type 0x7f; + dev-multifunction = !!(hdr_type 0x80); + dev-cfg_size = pci_cfg_space_size(dev); + dev-error_state = pci_channel_io_normal; + set_pcie_port_type(dev); + + list_for_each_entry(slot, dev-bus-slots, list) + if (PCI_SLOT(dev-devfn) == slot-number) + dev-slot = slot; + + /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer) + set this higher, assuming the system even supports it. */ + dev-dma_mask = 0x; dev_set_name(dev-dev, %04x:%02x:%02x.%d, pci_domain_nr(dev-bus), dev-bus-number, PCI_SLOT(dev-devfn), @@ -708,7 +743,6 @@ static int pci_setup_device(struct pci_dev * dev) /* Early fixups, before probing the BARs */ pci_fixup_device(pci_fixup_early, dev); - class = dev-class 8; switch (dev-hdr_type) {/* header type */ case PCI_HEADER_TYPE_NORMAL:/* standard header */ @@ -770,7 +804,7 @@ static int pci_setup_device(struct pci_dev * dev) default:/* unknown header */ dev_err(dev-dev, unknown header type %02x, ignoring device\n, dev-hdr_type); - return -1; + return -EIO; bad: dev_err(dev-dev, ignoring class %02x (doesn't match header @@ -804,19 +838,6 @@ static void pci_release_dev(struct device *dev) kfree(pci_dev); } -static void set_pcie_port_type(struct pci_dev *pdev) -{ - int pos; - u16 reg16; - - pos = pci_find_capability(pdev, PCI_CAP_ID_EXP); - if (!pos) - return; - pdev-is_pcie = 1; - pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16); - pdev-pcie_type = (reg16 PCI_EXP_FLAGS_TYPE) 4; -} - /** * pci_cfg_space_size - get the configuration space size of the PCI device. * @dev: PCI device @@ -892,9 +913,7 @@ EXPORT_SYMBOL(alloc_pci_dev); static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) { struct pci_dev *dev; - struct pci_slot *slot; u32 l; - u8 hdr_type; int delay = 1; if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)) @@ -921,34 +940,16 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) } } - if (pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, hdr_type)) - return NULL; - dev = alloc_pci_dev(); if (!dev) return NULL; dev-bus = bus; - dev-sysdata = bus-sysdata
[PATCH v11 3/8] PCI: reserve bus range for SR-IOV device
Reserve the bus number range used by the Virtual Function when pcibios_assign_all_busses() returns true. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 36 drivers/pci/pci.h |5 + drivers/pci/probe.c |3 +++ 3 files changed, 44 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 8df2246..fb8fab1 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -14,6 +14,18 @@ #include pci.h +static inline u8 virtfn_bus(struct pci_dev *dev, int id) +{ + return dev-bus-number + ((dev-devfn + dev-sriov-offset + + dev-sriov-stride * id) 8); +} + +static inline u8 virtfn_devfn(struct pci_dev *dev, int id) +{ + return (dev-devfn + dev-sriov-offset + + dev-sriov-stride * id) 0xff; +} + static int sriov_init(struct pci_dev *dev, int pos) { int i; @@ -209,3 +221,27 @@ void pci_restore_iov_state(struct pci_dev *dev) if (dev-is_physfn) sriov_restore_state(dev); } + +/** + * pci_iov_bus_range - find bus range used by Virtual Function + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr; + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + if (!dev-is_physfn) + continue; + busnr = virtfn_bus(dev, dev-sriov-total - 1); + if (busnr max) + max = busnr; + } + + return max ? max - bus-number : 0; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index efd79a2..7abdef6 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, static inline void pci_restore_iov_state(struct pci_dev *dev) { } +static inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 03b6f29..4c8abd0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 6/8] PCI: handle SR-IOV Virtual Function Migration
Add or remove a Virtual Function after receiving a Migrate In or Out Request. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 119 +++ drivers/pci/pci.h |4 ++ include/linux/pci.h |6 +++ 3 files changed, 129 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0a3af12..213fb61 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -179,6 +179,97 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset) pci_dev_put(dev); } +static int sriov_migration(struct pci_dev *dev) +{ + u16 status; + struct pci_sriov *iov = dev-sriov; + + if (!iov-nr_virtfn) + return 0; + + if (!(iov-cap PCI_SRIOV_CAP_VFM)) + return 0; + + pci_read_config_word(dev, iov-pos + PCI_SRIOV_STATUS, status); + if (!(status PCI_SRIOV_STATUS_VFM)) + return 0; + + schedule_work(iov-mtask); + + return 1; +} + +static void sriov_migration_task(struct work_struct *work) +{ + int i; + u8 state; + u16 status; + struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask); + + for (i = iov-initial; i iov-nr_virtfn; i++) { + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_MI) { + writeb(PCI_SRIOV_VFM_AV, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 1); + } else if (state == PCI_SRIOV_VFM_MO) { + virtfn_remove(iov-self, i, 1); + writeb(PCI_SRIOV_VFM_UA, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 0); + } + } + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + status = ~PCI_SRIOV_STATUS_VFM; + pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); +} + +static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn) +{ + int bir; + u32 table; + resource_size_t pa; + struct pci_sriov *iov = dev-sriov; + + if (nr_virtfn = iov-initial) + return 0; + + pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table); + bir = PCI_SRIOV_VFM_BIR(table); + if (bir PCI_STD_RESOURCE_END) + return -EIO; + + table = PCI_SRIOV_VFM_OFFSET(table); + if (table + nr_virtfn pci_resource_len(dev, bir)) + return -EIO; + + pa = pci_resource_start(dev, bir) + table; + iov-mstate = ioremap(pa, nr_virtfn); + if (!iov-mstate) + return -ENOMEM; + + INIT_WORK(iov-mtask, sriov_migration_task); + + iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR; + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + return 0; +} + +static void sriov_disable_migration(struct pci_dev *dev) +{ + struct pci_sriov *iov = dev-sriov; + + iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + cancel_work_sync(iov-mtask); + iounmap(iov-mstate); +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -261,6 +352,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) goto failed; } + if (iov-cap PCI_SRIOV_CAP_VFM) { + rc = sriov_enable_migration(dev, nr_virtfn); + if (rc) + goto failed; + } + kobject_uevent(dev-dev.kobj, KOBJ_CHANGE); iov-nr_virtfn = nr_virtfn; @@ -290,6 +387,9 @@ static void sriov_disable(struct pci_dev *dev) if (!iov-nr_virtfn) return; + if (iov-cap PCI_SRIOV_CAP_VFM) + sriov_disable_migration(dev); + for (i = 0; i iov-nr_virtfn; i++) virtfn_remove(dev, i, 0); @@ -559,3 +659,22 @@ void pci_disable_sriov(struct pci_dev *dev) sriov_disable(dev); } EXPORT_SYMBOL_GPL(pci_disable_sriov); + +/** + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration + * @dev: the PCI device + * + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not. + * + * Physical Function driver is responsible to register IRQ handler using + * VF Migration Interrupt Message Number, and call this function when the + * interrupt is generated by the hardware. + */ +irqreturn_t pci_sriov_migration(struct pci_dev *dev) +{ + if (!dev-is_physfn) + return IRQ_NONE; + + return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; +} +EXPORT_SYMBOL_GPL(pci_sriov_migration); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 1bdace3
[PATCH v11 7/8] PCI: document SR-IOV sysfs entries
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/ABI/testing/sysfs-bus-pci | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index e638e15..36edf03 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -52,3 +52,30 @@ Description: that some devices may have malformatted data. If the underlying VPD has a writable section then the corresponding section of this file will be writable. + +What: /sys/bus/pci/devices/.../virtfnN +Date: March 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbolic link appears when hardware supports the SR-IOV + capability and the Physical Function driver has enabled it. + The symbolic link points to the PCI device sysfs entry of the + Virtual Function whose index is N (0...MaxVFs-1). + +What: /sys/bus/pci/devices/.../dep_link +Date: March 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbolic link appears when hardware supports the SR-IOV + capability and the Physical Function driver has enabled it, + and this device has vendor specific dependencies with others. + The symbolic link points to the PCI device sysfs entry of + Physical Function this device depends on. + +What: /sys/bus/pci/devices/.../physfn +Date: March 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbolic link appears when a device is a Virtual Function. + The symbolic link points to the PCI device sysfs entry of the + Physical Function this device associates with. -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 8/8] PCI: manual for SR-IOV user and driver developer
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + 2 files changed, 100 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index bc962cd..58c1945 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c -- !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c /sect1 sect1titlePCI Hotplug Support Library/title !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000..fc73ef5 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,99 @@ + PCI Express I/O Virtualization Howto + Copyright (C) 2009 Intel Corporation + Yu Zhao yu.z...@intel.com + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +2. User Guide + +2.1 How can I enable SR-IOV capability + +The device driver (PF driver) will control the enabling and disabling +of the capability via API provided by SR-IOV core. If the hardware +has SR-IOV capability, loading its PF driver would enable it and all +VFs associated with the PF. + +2.2 How can I use the Virtual Functions + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +3. Developer Guide + +3.1 SR-IOV API + +To enable SR-IOV capability: + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + 'nr_virtfn' is number of VFs to be enabled. + +To disable SR-IOV capability: + void pci_disable_sriov(struct pci_dev *dev); + +To notify SR-IOV core of Virtual Function Migration: + irqreturn_t pci_sriov_migration(struct pci_dev *dev); + +3.2 Usage example + +Following piece of code illustrates the usage of the SR-IOV API. + +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) +{ + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + pci_disable_sriov(dev); + + ... +} + +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + ... + + return 0; +} + +static void dev_shutdown(struct pci_dev *dev) +{ + ... +} + +static struct pci_driver dev_driver = { + .name = SR-IOV Physical Function driver, + .id_table = dev_id_table, + .probe =dev_probe, + .remove = __devexit_p(dev_remove), + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, +}; -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 5/8] PCI: add SR-IOV API for Physical Function driver
Add or remove the Virtual Function when the SR-IOV is enabled or disabled by the device driver. This can happen anytime rather than only at the device probe stage. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 314 +++ drivers/pci/pci.h |2 + include/linux/pci.h | 19 +++- 3 files changed, 334 insertions(+), 1 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index fb8fab1..0a3af12 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -13,6 +13,7 @@ #include linux/delay.h #include pci.h +#define VIRTFN_ID_LEN 16 static inline u8 virtfn_bus(struct pci_dev *dev, int id) { @@ -26,6 +27,284 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id) dev-sriov-stride * id) 0xff; } +static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) +{ + int rc; + struct pci_bus *child; + + if (bus-number == busnr) + return bus; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + if (child) + return child; + + child = pci_add_new_bus(bus, NULL, busnr); + if (!child) + return NULL; + + child-subordinate = busnr; + child-dev.parent = bus-bridge; + rc = pci_bus_add_child(child); + if (rc) { + pci_remove_bus(child); + return NULL; + } + + return child; +} + +static void virtfn_remove_bus(struct pci_bus *bus, int busnr) +{ + struct pci_bus *child; + + if (bus-number == busnr) + return; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + BUG_ON(!child); + + if (list_empty(child-devices)) + pci_remove_bus(child); +} + +static int virtfn_add(struct pci_dev *dev, int id, int reset) +{ + int i; + int rc; + u64 size; + char buf[VIRTFN_ID_LEN]; + struct pci_dev *virtfn; + struct resource *res; + struct pci_sriov *iov = dev-sriov; + + virtfn = alloc_pci_dev(); + if (!virtfn) + return -ENOMEM; + + mutex_lock(iov-dev-sriov-lock); + virtfn-bus = virtfn_add_bus(dev-bus, virtfn_bus(dev, id)); + if (!virtfn-bus) { + kfree(virtfn); + mutex_unlock(iov-dev-sriov-lock); + return -ENOMEM; + } + virtfn-devfn = virtfn_devfn(dev, id); + virtfn-vendor = dev-vendor; + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device); + pci_setup_device(virtfn); + virtfn-dev.parent = dev-dev.parent; + + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_IOV_RESOURCES + i; + if (!res-parent) + continue; + virtfn-resource[i].name = pci_name(virtfn); + virtfn-resource[i].flags = res-flags; + size = resource_size(res); + do_div(size, iov-total); + virtfn-resource[i].start = res-start + size * id; + virtfn-resource[i].end = virtfn-resource[i].start + size - 1; + rc = request_resource(res, virtfn-resource[i]); + BUG_ON(rc); + } + + if (reset) + pci_execute_reset_function(virtfn); + + pci_device_add(virtfn, virtfn-bus); + mutex_unlock(iov-dev-sriov-lock); + + virtfn-physfn = pci_dev_get(dev); + virtfn-is_virtfn = 1; + + rc = pci_bus_add_device(virtfn); + if (rc) + goto failed1; + sprintf(buf, virtfn%u, id); + rc = sysfs_create_link(dev-dev.kobj, virtfn-dev.kobj, buf); + if (rc) + goto failed1; + rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn); + if (rc) + goto failed2; + + kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE); + + return 0; + +failed2: + sysfs_remove_link(dev-dev.kobj, buf); +failed1: + pci_dev_put(dev); + mutex_lock(iov-dev-sriov-lock); + pci_remove_bus_device(virtfn); + virtfn_remove_bus(dev-bus, virtfn_bus(dev, id)); + mutex_unlock(iov-dev-sriov-lock); + + return rc; +} + +static void virtfn_remove(struct pci_dev *dev, int id, int reset) +{ + char buf[VIRTFN_ID_LEN]; + struct pci_bus *bus; + struct pci_dev *virtfn; + struct pci_sriov *iov = dev-sriov; + + bus = pci_find_bus(pci_domain_nr(dev-bus), virtfn_bus(dev, id)); + if (!bus) + return; + + virtfn = pci_get_slot(bus, virtfn_devfn(dev, id)); + if (!virtfn) + return; + + pci_dev_put(virtfn); + + if (reset) { + device_release_driver(virtfn-dev); + pci_execute_reset_function(virtfn); + } + + sprintf(buf, virtfn%u, id); + sysfs_remove_link(dev-dev.kobj, buf); + sysfs_remove_link(virtfn-dev.kobj, physfn); + + mutex_lock(iov-dev-sriov-lock
Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
On Sat, Mar 07, 2009 at 04:08:10AM +0800, Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote: +config PCI_IOV + bool PCI IOV support + depends on PCI + select PCI_MSI My understanding is that having 'select' of a config symbol that the user can choose is bad. I think we should probably make this 'depends on PCI_MSI'. PCI MSI can also be disabled at runtime (and Fedora do by default). Since SR-IOV really does require MSI, we need to put in a runtime check to see if pci_msi_enabled() is false. Actually the SR-IOV doesn't really depend on the MSI (e.g. hardware doesn't implement interrupt at all), but in most case the SR-IOV needs the MSI. The selection is intended to make life easier. Anyway I'll remove it if people want more flexibility (and possibility to break the PF driver). We don't depend on PCIEPORTBUS (a horribly named symbol). Should we? SR-IOV is only supported for PCI Express machines. I'm not sure of the right answer here, but I thought I should raise the question. I think we don't need PCIe port bus framework. My understanding is it's for those capabilities that want to share resources of the PCIe capability. + default n You don't need this -- the default default is n ;-) + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the Physical Function driver to enable + the hardware capability, so the Virtual Function is accessible + via the PCI Configuration Space using its own Bus, Device and + Function Numbers. Each Virtual Function also has the PCI Memory + Space to map the device specific register set. I'm not convinced this is the most helpful we could be to the user who's configuring their own kernel. How about something like this? (Randy, I particularly look to you to make my prose less turgid). help IO Virtualisation is a PCI feature supported by some devices which allows you to create virtual PCI devices and assign them to guest OSes. This option needs to be selected in the host or Dom0 kernel, but does not need to be selected in the guest or DomU kernel. If you don't know whether your hardware supports it, you can check by using lspci to look for the SR-IOV capability. If you have no idea what any of that means, it is safe to answer 'N' here. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba99282 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +# PCI IOV support +obj-$(CONFIG_PCI_IOV) += iov.o I see you're following the gerneal style in this file, but the comments really add no value. I should send a patch to take out the existing ones. + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-sriov) + break; + if (list_empty(dev-bus-devices) || !pdev-sriov) + pdev = NULL; + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + I don't like this loop. At the end of a list_for_each_entry() loop, pdev will not be pointing at a pci_device, it'll be pointing to some offset from dev-bus-devices. So checking pdev-sriov at this point is really, really bad. I would prefer to see something like this: ctrl = 0; list_for_each_entry(pdev, dev-bus-devices, bus_list) { if (pdev-sriov) goto ari_enabled; } if (pci_ari_enabled(dev-bus)) ctrl = PCI_SRIOV_CTRL_ARI; ari_enabled: pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); I guess I should put some comments here. What I want to do is to find the lowest numbered PF (pdev) if it exists. It has ARI Capable Hierarchy bit, as you have figured out, and it also keeps the VF bus lock. The lock is for those VFs who belong to different PFs within a SR-IOV device and reside on different bus (virtual) than PF's. When the PF driver enables/disables the SR-IOV of a PF (this may happen anytime, not only at the driver probe stage), the virtual VF bus will be allocated if it hasn't been allocated yet. The lock guards the VF bus allocation between different PFs whose VFs share the VF bus. + if (pdev) + iov-pdev = pci_dev_get(pdev); + else { + iov-pdev = dev; + mutex_init(iov-lock); + } Now I'm confused. Why don't we need to init the mutex if there's another device on the same bus which also has an iov capability? Yes, that's what it means :-) +static void sriov_release(struct pci_dev *dev) +{ + if (dev == dev-sriov-pdev) + mutex_destroy(dev-sriov-lock); + else + pci_dev_put(dev-sriov-pdev); + + kfree(dev-sriov
Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
On Sat, Mar 07, 2009 at 04:20:24AM +0800, Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:44PM +0800, Yu Zhao wrote: +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) +{ + u16 bdf; + + bdf = (dev-bus-number 8) + dev-devfn + + dev-sriov-offset + dev-sriov-stride * id; + *busnr = bdf 8; + *devfn = bdf 0xff; +} I find the interface here a bit clunky -- a function returning void while having two OUT parameters. How about this variation on the theme (viewers are encouraged to come up with their own preferred implementations and interfaces): static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id) { return (dev-bus-number 8) + dev-devfn + dev-sriov-offset + dev-sriov-stride * id; } #define VIRT_BUS(dev, id) (virtfn_bdf(dev, id) 8) #define VIRT_DEVFN(dev, id) (virtfn_bdf(dev, id) 0xff) We rely on GCC to do CSE and not actually invoke virtfn_bdf more than once. Yes, that's a good idea. Will replace that function with macros. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
On Sat, Mar 07, 2009 at 04:37:18AM +0800, Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote: + virtfn-sysdata = dev-bus-sysdata; + virtfn-dev.parent = dev-dev.parent; + virtfn-dev.bus = dev-dev.bus; + virtfn-devfn = devfn; + virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL; + virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE; + virtfn-error_state = pci_channel_io_normal; + virtfn-current_state = PCI_UNKNOWN; + virtfn-is_pcie = 1; + virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT; + virtfn-dma_mask = 0x; + virtfn-vendor = dev-vendor; + virtfn-subsystem_vendor = dev-subsystem_vendor; + virtfn-class = dev-class; There seems to be a certain amount of commonality between this and pci_scan_device(). Have you considered trying to make a common helper function, or does it not work out well? It's doable. Will enhance the pci_setup_device and use it to setup the VF. + pci_device_add(virtfn, virtfn-bus); Greg is probably going to ding you here for adding the device, then creating the symlinks. I believe it's now best practice to create the symlinks first, so there's no window where userspace can get confused. Yes, but unfortunately we can't create links before adding a device. I double checked device_add(), there is no place for those links to be created before it sends uevent. So for now, we have to trigger another uevent for those links. + mutex_unlock(iov-pdev-sriov-lock); I question the existance of this mutex now. What's it protecting? Aren't we going to be implicitly protected by virtue of the Physical Function device driver being the only one calling this function, and the driver will be calling it from the -probe routine which is not called simultaneously for the same device. The PF driver patches I listed before support dynamical enabling/disabling of the SR-IOV through sysfs interface. So we have to protect the VF bus allocation as I explained before. + virtfn-physfn = pci_dev_get(dev); + + rc = pci_bus_add_device(virtfn); + if (rc) + goto failed1; + sprintf(buf, %d, id); %u, perhaps? And maybe 'id' should always be unsigned? Just a thought. Yes, will replace %d to %u. + rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf); + if (rc) + goto failed1; + rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn); + if (rc) + goto failed2; I'm glad to see these symlinks documented in later patches! + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + if (!res-parent) + continue; + nres++; + } Can't this be written more simply as: for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { res = dev-resource + PCI_SRIOV_RESOURCES + i; if (res-parent) nres++; } Yes, will do ? + if (nres != iov-nres) { + dev_err(dev-dev, no enough MMIO for SR-IOV\n); + return -ENOMEM; + } Randy, can you help us out with better wording here? + dev_err(dev-dev, no enough bus range for SR-IOV\n); and here. + if (iov-link != dev-devfn) { + rc = -ENODEV; + list_for_each_entry(link, dev-bus-devices, bus_list) { + if (link-sriov link-devfn == iov-link) + rc = sysfs_create_link(iov-dev.kobj, + link-dev.kobj, dep_link); I skipped to the end and read patch 7/7 and I still don't understand what dep_link is for. Can you explain please? In particular, how is it different from physfn? It's defined by spec as: 3.3.8. Function Dependency Link (12h) The programming model for a Device may have vendor specific dependencies between sets of Functions. The Function Dependency Link field is used to describe these dependencies. This field describes dependencies between PFs. VF dependencies are the same as the dependencies of their associated PFs. If a PF is independent from other PFs of a Device, this field shall contain its own Function Number. If a PF is dependent on other PFs of a Device, this field shall contain the Function Number of the next PF in the same Function Dependency List. The last PF in a Function Dependency List shall contain the Function Number of the first PF in the Function Dependency List. If PF p and PF q are in the same Function Dependency List, than any SI that is assigned VF p,n shall also be assigned to VF q,n. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration
On Sat, Mar 07, 2009 at 05:13:41AM +0800, Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:46PM +0800, Yu Zhao wrote: +static int sriov_migration(struct pci_dev *dev) +{ + u16 status; + struct pci_sriov *iov = dev-sriov; + + if (!iov-nr_virtfn) + return 0; + + if (!(iov-cap PCI_SRIOV_CAP_VFM)) + return 0; + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); You passed in dev here, you don't need to use iov-self, right? Will do. + if (!(status PCI_SRIOV_STATUS_VFM)) + return 0; + + schedule_work(iov-mtask); + + return 1; +} +/** + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration + * @dev: the PCI device + * + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not. + * + * Physical Function driver is responsible to register IRQ handler using + * VF Migration Interrupt Message Number, and call this function when the + * interrupt is generated by the hardware. + */ +irqreturn_t pci_sriov_migration(struct pci_dev *dev) +{ + if (!dev-sriov) + return IRQ_NONE; + + return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; +} +EXPORT_SYMBOL_GPL(pci_sriov_migration); OK, I think I get it -- you've basically written an interrupt handler for the driver to call from its interrupt handler. Am I right in thinking that the reason the driver needs to do the interrupt handler here is because we don't currently have an interface that looks like: int pci_get_msix_interrupt(struct pci_dev *dev, unsigned vector); ? If so, we should probably add it; I want it for my MSI-X rewrite anyway. Right, we really need this function. But I guess we still have to keep the handler in case the PF only has MSI, right? Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
Thanks a lot, Randy! On Sat, Mar 07, 2009 at 05:48:33AM +0800, Randy Dunlap wrote: Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote: + if (nres != iov-nres) { + dev_err(dev-dev, no enough MMIO for SR-IOV\n); + return -ENOMEM; + } not enough MMIO BARs for SR-IOV or not enough MMIO resources for SR-IOV or too few MMIO BARs for SR-IOV ? Randy, can you help us out with better wording here? + dev_err(dev-dev, no enough bus range for SR-IOV\n); and here. SR-IOV: bus number too large or SR-IOV: bus number out of range or SR-IOV: cannot allocate valid bus number ? + if (iov-link != dev-devfn) { + rc = -ENODEV; + list_for_each_entry(link, dev-bus-devices, bus_list) { + if (link-sriov link-devfn == iov-link) + rc = sysfs_create_link(iov-dev.kobj, + link-dev.kobj, dep_link); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
On Sat, Mar 07, 2009 at 10:34:54AM +0800, Greg KH wrote: On Fri, Mar 06, 2009 at 12:44:11PM -0700, Matthew Wilcox wrote: Physical Function driver patches for Intel 82576 NIC are available: http://patchwork.kernel.org/patch/8063/ http://patchwork.kernel.org/patch/8064/ http://patchwork.kernel.org/patch/8065/ http://patchwork.kernel.org/patch/8066/ I need to review this driver; I haven't done that yet. Has anyone else? The driver was rejected by the upstream developers, who said it would never be accepted. Sorry I didn't make it clear. These Physical Function driver patches are new ones that have been accepted by David Miller (net-next-2.6). The old ones I sent last time are for demonstration purpose, and won't be in any upstream trees. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
On Sat, Mar 07, 2009 at 10:38:45AM +0800, Greg KH wrote: On Fri, Mar 06, 2009 at 01:08:10PM -0700, Matthew Wilcox wrote: On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote: + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-sriov) + break; + if (list_empty(dev-bus-devices) || !pdev-sriov) + pdev = NULL; + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + I don't like this loop. At the end of a list_for_each_entry() loop, pdev will not be pointing at a pci_device, it'll be pointing to some offset from dev-bus-devices. So checking pdev-sriov at this point is really, really bad. I would prefer to see something like this: ctrl = 0; list_for_each_entry(pdev, dev-bus-devices, bus_list) { if (pdev-sriov) goto ari_enabled; } if (pci_ari_enabled(dev-bus)) ctrl = PCI_SRIOV_CTRL_ARI; ari_enabled: pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); No, please use bus_for_each_dev() instead, or bus_find_device(), don't walk the bus list by hand. I'm kind of surprised that even builds. Hm, in looking at the 2.6.29-rc kernels, I notice it will not even build at all, you are now forced to use those functions, which is good. The devices haven't been added at this time, so we can't use bus_for_each_dev(). I guess that's why the `bus-devices' exists, and actually pci_bus_add_devices() walks the bus list same way to retrieve the devices and add them. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
On Tue, Mar 10, 2009 at 03:39:01AM +0800, Greg KH wrote: On Mon, Mar 09, 2009 at 04:25:05PM +0800, Yu Zhao wrote: + pci_device_add(virtfn, virtfn-bus); Greg is probably going to ding you here for adding the device, then creating the symlinks. I believe it's now best practice to create the symlinks first, so there's no window where userspace can get confused. Yes, but unfortunately we can't create links before adding a device. I double checked device_add(), there is no place for those links to be created before it sends uevent. So for now, we have to trigger another uevent for those links. What exactly are you trying to do with a symlink here that you need to do it this way? I vaguely remember you mentioning this in the past, but I thought you had dropped the symlinks after our conversation about this very problem. I'd like to create some symlinks to reflect the relationship between Physical Function and its associated Virtual Functions. The Physical Function is like a master device that controls the allocation of its Virtual Functions and owns the device physical resource. The Virtual Functions are like slave devices of the Physical Function. For example, if 01:00.0 is a Physical Function and 02:00.0 is a Virtual Function associated with 01:00.0. Then the symlinks (virtfnN and physfn) would look like: $ ls -l /sys/bus/pci/devices/:01:00.0/ ... ... virtfn0 - ../:02:00.0 ... virtfn1 - ../:02:00.1 ... virtfn2 - ../:02:00.2 ... $ ls -l /sys/bus/pci/devices/:02:00.0/ ... ... physfn - ../:01:00.0 ... This is very useful for userspace applications, both KVM and Xen need to know this kind of relationship so they can request the permission from a Physical Function before using its associated Virtual Functions. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] ATS capability support for Intel IOMMU
On Sun, Feb 15, 2009 at 06:59:10AM +0800, Grant Grundler wrote: On Thu, Feb 12, 2009 at 08:50:32PM +0800, Yu Zhao wrote: This patch series implements Address Translation Service support for the Intel IOMMU. ATS makes the PCI Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the Endpoint, thus alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. Changelog: v2 - v3 1, throw error message if VT-d hardware detects invalid descriptor on Queued Invalidation interface (David Woodhouse) 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) Changelog: v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add queue invalidation fault status support VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c | 230 ++ Yu, Can you please add something to Documentation/PCI/pci.txt? New API I'm seeing are: +extern int pci_enable_ats(struct pci_dev *dev, int ps); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_queue_depth(struct pci_dev *dev); Yes, I'll document these new API. Do these also need to be EXPORT_SYMBOL_GPL() as well? Or are drivers never expected to call the above? PCI device driver shouldn't use these API, only IOMMU driver (can't be module) would use them. Anyway it's a good idea to export them :-) Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 6/6] VT-d: support the device IOTLB
On Sun, Feb 15, 2009 at 07:20:52AM +0800, Grant Grundler wrote: On Thu, Feb 12, 2009 at 08:50:38PM +0800, Yu Zhao wrote: Support device IOTLB (i.e. ATS) for both native and KVM environments. + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} Why is a static function defined that calls a global function? There would be some extra steps to do before VT-d enables ATS in the future, so this wrapper makes code expandable later. + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (info-dev pci_ats_enabled(info-dev)) + pci_disable_ats(info-dev); +} ditto. pci_disable_ats() should be able to handle the case when info-dev is NULL and will know if ATS is enabled. The info-dev could be NULL only because the VT-d code makes it so. AMD an IBM IOMMU may not have this requirement. If we make pci_disable_ats() accept NULL pci_dev, it would fail to catch some errors like using pci_disable_ats without calling pci_enable_ats before. I think both of these functions can be dropped and just directly call pci_*_ats(). + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned mask) +{ + int rc; + u16 sid, qdep; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { Would it be possible to define a single domain for each PCI device? Or does domain represent an IOMMU? Sorry, I forgot...I'm sure someone has mentioned this the past. A domain represents one translation mapping. For device used by the host, there is one domain per device. Device assigned to a guest shares one domain. I want to point out list_for_each_entry() is effectively a nested loop. iommu_flush_dev_iotlb() will get called alot from flush_unmaps(). Perhaps do the lookup once there and pass that as a parameter? I don't know if that is feasible. But if this is a very frequently used code path, every CPU cycle counts. iommu_flush_dev_iotlb() is only used to flush the devices used in the host, which means there is always one entry on the list. + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + qdep = pci_ats_queue_depth(info-dev); re Matthew Wilcox's comment - looks like caching ats_queue_depth is appropriate. Yes, it's cached as of v3. + rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask); + if (rc) + printk(KERN_ERR IOMMU: flush device IOTLB failed\n); Can this be a dev_printk please? Yes, will replace it with dev_err(). Perhaps in general review the use of printk so when errors are reported, users will know which devices might be affected by the failure. If more than a few printk's should be converted to dev_printk(), I'd be happy if that were a seperate patch (submitted later). pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1534,7 +1608,11 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, bus, devfn); + if (info) + context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB); + else + context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); Would it be ok to rewrite this as: + context_set_translation_type(context, + info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL); Yes, this one looks better. context_set_fault_enable(context); context_set_present(context); domain_flush_cache(domain, context, sizeof(*context)); @@ -1546,6 +1624,8 @@ static int domain_context_mapping_one(struct dmar_domain *domain, iommu_flush_write_buffer(iommu); else iommu-flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_DSI_FLUSH, 0); Adding a blank line here would make this more readable. (AFAIK, not required by coding style, just my opinion.) Yes, I prefer a bank line here too, somehow I missed it. + if (info) + iommu_enable_dev_iotlb(info); Could iommu_enable_dev_iotlb() (or pci_enable_ats()) check if info is NULL? Then this would just be a simple function call. And it would be consistent with usage of iommu_disable_dev_iotlb(). Yes, good idea. Thanks a lot for reviewing it! Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
On Tue, Feb 24, 2009 at 06:47:38PM +0800, Avi Kivity wrote: Yu Zhao wrote: Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. Do those patches allow using a VF on the host (in other words, does the kernel emulate config space accesses)? Yes, if a VF's driver is loaded in the host, the VF works the same way as normal PCI device. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 34 ++ drivers/pci/pci.h |5 + drivers/pci/probe.c |3 +++ 3 files changed, 42 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 3bca8f8..0b80437 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -14,6 +14,16 @@ #include pci.h +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) +{ + u16 bdf; + + bdf = (dev-bus-number 8) + dev-devfn + + dev-sriov-offset + dev-sriov-stride * id; + *busnr = bdf 8; + *devfn = bdf 0xff; +} + static int sriov_init(struct pci_dev *dev, int pos) { int i; @@ -208,3 +218,27 @@ void pci_restore_iov_state(struct pci_dev *dev) if (dev-sriov) sriov_restore_state(dev); } + +/** + * pci_iov_bus_range - find bus range used by Virtual Function + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + if (!dev-sriov) + continue; + virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn); + if (busnr max) + max = busnr; + } + + return max ? max - bus-number : 0; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index b24c9e2..2cf32f5 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, static inline void pci_restore_iov_state(struct pci_dev *dev) { } +static inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 03b6f29..4c8abd0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 0/7] PCI: Linux kernel SR-IOV support
Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. SR-IOV specification can be found at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf (it requires membership.) Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf Physical Function driver patches for Intel 82576 NIC are available: http://patchwork.kernel.org/patch/8063/ http://patchwork.kernel.org/patch/8064/ http://patchwork.kernel.org/patch/8065/ http://patchwork.kernel.org/patch/8066/ Major changes from v9 to v10: 1, minor fix in pci_restore_iov_state(). 2, respin against the latest tree. Yu Zhao (7): PCI: initialize and release SR-IOV capability PCI: restore saved SR-IOV state PCI: reserve bus range for SR-IOV device PCI: add SR-IOV API for Physical Function driver PCI: handle SR-IOV Virtual Function Migration PCI: document SR-IOV sysfs entries PCI: manual for SR-IOV user and driver developer Documentation/ABI/testing/sysfs-bus-pci | 27 ++ Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + drivers/pci/Kconfig | 13 + drivers/pci/Makefile|3 + drivers/pci/iov.c | 711 +++ drivers/pci/pci.c |8 + drivers/pci/pci.h | 53 +++ drivers/pci/probe.c |7 + include/linux/pci.h | 28 ++ include/linux/pci_regs.h| 33 ++ 11 files changed, 983 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt create mode 100644 drivers/pci/iov.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 2/7] PCI: restore saved SR-IOV state
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 29 + drivers/pci/pci.c |1 + drivers/pci/pci.h |4 3 files changed, 34 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index e6736d4..3bca8f8 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -128,6 +128,25 @@ static void sriov_release(struct pci_dev *dev) dev-sriov = NULL; } +static void sriov_restore_state(struct pci_dev *dev) +{ + int i; + u16 ctrl; + struct pci_sriov *iov = dev-sriov; + + pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) + return; + + for (i = PCI_SRIOV_RESOURCES; i = PCI_SRIOV_RESOURCE_END; i++) + pci_update_resource(dev, i); + + pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + if (iov-ctrl PCI_SRIOV_CTRL_VFE) + msleep(100); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -179,3 +198,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno, return dev-sriov-pos + PCI_SRIOV_BAR + 4 * (resno - PCI_SRIOV_RESOURCES); } + +/** + * pci_restore_iov_state - restore the state of the IOV capability + * @dev: the PCI device + */ +void pci_restore_iov_state(struct pci_dev *dev) +{ + if (dev-sriov) + sriov_restore_state(dev); +} diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 2eba2a5..8e21912 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev) } pci_restore_pcix_state(dev); pci_restore_msi_state(dev); + pci_restore_iov_state(dev); return 0; } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 451db74..b24c9e2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern void pci_restore_iov_state(struct pci_dev *dev); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +static inline void pci_restore_iov_state(struct pci_dev *dev) +{ +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 1/7] PCI: initialize and release SR-IOV capability
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/Kconfig | 13 drivers/pci/Makefile |3 + drivers/pci/iov.c| 181 ++ drivers/pci/pci.c|7 ++ drivers/pci/pci.h| 37 ++ drivers/pci/probe.c |4 + include/linux/pci.h |8 ++ include/linux/pci_regs.h | 33 + 8 files changed, 286 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 2a4501d..e8ea3e8 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -59,3 +59,16 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool PCI IOV support + depends on PCI + select PCI_MSI + default n + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the Physical Function driver to enable + the hardware capability, so the Virtual Function is accessible + via the PCI Configuration Space using its own Bus, Device and + Function Numbers. Each Virtual Function also has the PCI Memory + Space to map the device specific register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba99282 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +# PCI IOV support +obj-$(CONFIG_PCI_IOV) += iov.o + # # Some architectures use the generic PCI setup functions # diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 000..e6736d4 --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,181 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com + * + * PCI Express I/O Virtualization (IOV) support. + * Single Root IOV 1.0 + */ + +#include linux/pci.h +#include linux/mutex.h +#include linux/string.h +#include linux/delay.h +#include pci.h + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + ssleep(1); + } + + pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total); + if (!total) + return 0; + + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-sriov) + break; + if (list_empty(dev-bus-devices) || !pdev-sriov) + pdev = NULL; + + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; + + pgsz = ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + i += __pci_read_base(dev, pci_bar_unknown, res, +pos + PCI_SRIOV_BAR + i * 4); + if (!res-flags) + continue; + if (resource_size(res) (PAGE_SIZE - 1)) { + rc = -EIO; + goto failed; + } + res-end = res-start + resource_size(res) * total - 1; + nres++; + } + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) { + rc = -ENOMEM; + goto failed; + } + + iov-pos = pos; + iov-nres = nres; + iov-ctrl = ctrl; + iov-total = total; + iov-offset = offset; + iov-stride = stride; + iov-pgsz = pgsz; + iov-self = dev; + pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap); + pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link); + + if (pdev) + iov-pdev = pci_dev_get(pdev); + else { + iov-pdev = dev
[PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 348 +++ drivers/pci/pci.h |3 + include/linux/pci.h | 14 ++ 3 files changed, 365 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0b80437..8096fc9 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -13,6 +13,8 @@ #include linux/delay.h #include pci.h +#define VIRTFN_ID_LEN 8 + static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) { @@ -24,6 +26,319 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) *devfn = bdf 0xff; } +static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) +{ + int rc; + struct pci_bus *child; + + if (bus-number == busnr) + return bus; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + if (child) + return child; + + child = pci_add_new_bus(bus, NULL, busnr); + if (!child) + return NULL; + + child-subordinate = busnr; + child-dev.parent = bus-bridge; + rc = pci_bus_add_child(child); + if (rc) { + pci_remove_bus(child); + return NULL; + } + + return child; +} + +static void virtfn_remove_bus(struct pci_bus *bus, int busnr) +{ + struct pci_bus *child; + + if (bus-number == busnr) + return; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + BUG_ON(!child); + + if (list_empty(child-devices)) + pci_remove_bus(child); +} + +static int virtfn_add(struct pci_dev *dev, int id, int reset) +{ + int i; + int rc; + u64 size; + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct pci_dev *virtfn; + struct resource *res; + struct pci_sriov *iov = dev-sriov; + + virtfn = alloc_pci_dev(); + if (!virtfn) + return -ENOMEM; + + virtfn_bdf(dev, id, busnr, devfn); + mutex_lock(iov-pdev-sriov-lock); + virtfn-bus = virtfn_add_bus(dev-bus, busnr); + if (!virtfn-bus) { + kfree(virtfn); + mutex_unlock(iov-pdev-sriov-lock); + return -ENOMEM; + } + + virtfn-sysdata = dev-bus-sysdata; + virtfn-dev.parent = dev-dev.parent; + virtfn-dev.bus = dev-dev.bus; + virtfn-devfn = devfn; + virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL; + virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE; + virtfn-error_state = pci_channel_io_normal; + virtfn-current_state = PCI_UNKNOWN; + virtfn-is_pcie = 1; + virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT; + virtfn-dma_mask = 0x; + virtfn-vendor = dev-vendor; + virtfn-subsystem_vendor = dev-subsystem_vendor; + virtfn-class = dev-class; + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device); + pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision); + pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID, +virtfn-subsystem_device); + + dev_set_name(virtfn-dev, %04x:%02x:%02x.%d, +pci_domain_nr(virtfn-bus), busnr, +PCI_SLOT(devfn), PCI_FUNC(devfn)); + + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + if (!res-parent) + continue; + virtfn-resource[i].name = pci_name(virtfn); + virtfn-resource[i].flags = res-flags; + size = resource_size(res); + do_div(size, iov-total); + virtfn-resource[i].start = res-start + size * id; + virtfn-resource[i].end = virtfn-resource[i].start + size - 1; + rc = request_resource(res, virtfn-resource[i]); + BUG_ON(rc); + } + + if (reset) + pci_execute_reset_function(virtfn); + + pci_device_add(virtfn, virtfn-bus); + mutex_unlock(iov-pdev-sriov-lock); + + virtfn-physfn = pci_dev_get(dev); + + rc = pci_bus_add_device(virtfn); + if (rc) + goto failed1; + sprintf(buf, %d, id); + rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf); + if (rc) + goto failed1; + rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn); + if (rc) + goto failed2; + + kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE); + + return 0; + +failed2: + sysfs_remove_link(iov-dev.kobj, buf); +failed1: + pci_dev_put(dev); + mutex_lock(iov-pdev-sriov-lock); + pci_remove_bus_device(virtfn); + virtfn_remove_bus(dev-bus, busnr); + mutex_unlock(iov-pdev-sriov-lock); + + return rc; +} + +static void virtfn_remove(struct pci_dev *dev, int id, int reset) +{ + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct
[PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 119 +++ drivers/pci/pci.h |4 ++ include/linux/pci.h |6 +++ 3 files changed, 129 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 8096fc9..063fe74 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -206,6 +206,97 @@ static void sriov_release_dev(struct device *dev) iov-nr_virtfn = 0; } +static int sriov_migration(struct pci_dev *dev) +{ + u16 status; + struct pci_sriov *iov = dev-sriov; + + if (!iov-nr_virtfn) + return 0; + + if (!(iov-cap PCI_SRIOV_CAP_VFM)) + return 0; + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + if (!(status PCI_SRIOV_STATUS_VFM)) + return 0; + + schedule_work(iov-mtask); + + return 1; +} + +static void sriov_migration_task(struct work_struct *work) +{ + int i; + u8 state; + u16 status; + struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask); + + for (i = iov-initial; i iov-nr_virtfn; i++) { + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_MI) { + writeb(PCI_SRIOV_VFM_AV, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 1); + } else if (state == PCI_SRIOV_VFM_MO) { + virtfn_remove(iov-self, i, 1); + writeb(PCI_SRIOV_VFM_UA, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 0); + } + } + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + status = ~PCI_SRIOV_STATUS_VFM; + pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); +} + +static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn) +{ + int bir; + u32 table; + resource_size_t pa; + struct pci_sriov *iov = dev-sriov; + + if (nr_virtfn = iov-initial) + return 0; + + pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table); + bir = PCI_SRIOV_VFM_BIR(table); + if (bir PCI_STD_RESOURCE_END) + return -EIO; + + table = PCI_SRIOV_VFM_OFFSET(table); + if (table + nr_virtfn pci_resource_len(dev, bir)) + return -EIO; + + pa = pci_resource_start(dev, bir) + table; + iov-mstate = ioremap(pa, nr_virtfn); + if (!iov-mstate) + return -ENOMEM; + + INIT_WORK(iov-mtask, sriov_migration_task); + + iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR; + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + return 0; +} + +static void sriov_disable_migration(struct pci_dev *dev) +{ + struct pci_sriov *iov = dev-sriov; + + iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + cancel_work_sync(iov-mtask); + iounmap(iov-mstate); +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -294,6 +385,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) goto failed2; } + if (iov-cap PCI_SRIOV_CAP_VFM) { + rc = sriov_enable_migration(dev, nr_virtfn); + if (rc) + goto failed2; + } + kobject_uevent(dev-dev.kobj, KOBJ_CHANGE); iov-nr_virtfn = nr_virtfn; @@ -325,6 +422,9 @@ static void sriov_disable(struct pci_dev *dev) if (!iov-nr_virtfn) return; + if (iov-cap PCI_SRIOV_CAP_VFM) + sriov_disable_migration(dev); + for (i = 0; i iov-nr_virtfn; i++) virtfn_remove(dev, i, 0); @@ -590,3 +690,22 @@ void pci_disable_sriov(struct pci_dev *dev) sriov_disable(dev); } EXPORT_SYMBOL_GPL(pci_disable_sriov); + +/** + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration + * @dev: the PCI device + * + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not. + * + * Physical Function driver is responsible to register IRQ handler using + * VF Migration Interrupt Message Number, and call this function when the + * interrupt is generated by the hardware. + */ +irqreturn_t pci_sriov_migration(struct pci_dev *dev) +{ + if (!dev-sriov) + return IRQ_NONE; + + return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; +} +EXPORT_SYMBOL_GPL(pci_sriov_migration); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9bbf868..6764f02 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -1,6 +1,8 @@ #ifndef
[PATCH v10 6/7] PCI: document SR-IOV sysfs entries
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/ABI/testing/sysfs-bus-pci | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ceddcff..84dc100 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -9,3 +9,30 @@ Description: that some devices may have malformatted data. If the underlying VPD has a writable section then the corresponding section of this file will be writable. + +What: /sys/bus/pci/devices/.../virtfn/N +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it. + The symbol link points to the PCI device sysfs entry of + Virtual Function whose index is N (0...MaxVFs-1). + +What: /sys/bus/pci/devices/.../virtfn/dep_link +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it, + and this device has vendor specific dependencies with + others. The symbol link points to the PCI device sysfs + entry of Physical Function this device depends on. + +What: /sys/bus/pci/devices/.../physfn +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when a device is Virtual Function. + The symbol link points to the PCI device sysfs entry of + Physical Function this device associates with. -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + 2 files changed, 100 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 5818ff7..506e611 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c -- !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c /sect1 sect1titlePCI Hotplug Support Library/title !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000..fc73ef5 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,99 @@ + PCI Express I/O Virtualization Howto + Copyright (C) 2009 Intel Corporation + Yu Zhao yu.z...@intel.com + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +2. User Guide + +2.1 How can I enable SR-IOV capability + +The device driver (PF driver) will control the enabling and disabling +of the capability via API provided by SR-IOV core. If the hardware +has SR-IOV capability, loading its PF driver would enable it and all +VFs associated with the PF. + +2.2 How can I use the Virtual Functions + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +3. Developer Guide + +3.1 SR-IOV API + +To enable SR-IOV capability: + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + 'nr_virtfn' is number of VFs to be enabled. + +To disable SR-IOV capability: + void pci_disable_sriov(struct pci_dev *dev); + +To notify SR-IOV core of Virtual Function Migration: + irqreturn_t pci_sriov_migration(struct pci_dev *dev); + +3.2 Usage example + +Following piece of code illustrates the usage of the SR-IOV API. + +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) +{ + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + pci_disable_sriov(dev); + + ... +} + +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + ... + + return 0; +} + +static void dev_shutdown(struct pci_dev *dev) +{ + ... +} + +static struct pci_driver dev_driver = { + .name = SR-IOV Physical Function driver, + .id_table = dev_id_table, + .probe =dev_probe, + .remove = __devexit_p(dev_remove), + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, +}; -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] VT-d: enable DMAR on 32-bit kernel
From: David Woodhouse dw...@infradead.org If we fix a few highmem-related thinkos and a couple of printk format warnings, the Intel IOMMU driver works fine in a 32-bit kernel. -- Fixed end address roundup problem in dma_pte_clear_range(). Tested both 32 and 32 PAE modes on Intel X58 and Q35 platforms. Signed-off-by: Yu Zhao yu.z...@intel.com --- arch/x86/Kconfig |2 +- drivers/pci/intel-iommu.c | 24 +++- 2 files changed, 12 insertions(+), 14 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 9c39095..9e9ac5c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1794,7 +1794,7 @@ config PCI_MMCONFIG config DMAR bool Support for DMA Remapping Devices (EXPERIMENTAL) - depends on X86_64 PCI_MSI ACPI EXPERIMENTAL + depends on PCI_MSI ACPI EXPERIMENTAL help DMA remapping (DMAR) devices support enables independent address translations for Direct Memory Access (DMA) from devices. diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index f4b7c79..03bc0e5 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -687,15 +687,17 @@ static void dma_pte_clear_one(struct dmar_domain *domain, u64 addr) static void dma_pte_clear_range(struct dmar_domain *domain, u64 start, u64 end) { int addr_width = agaw_to_width(domain-agaw); + int npages; start = (((u64)1) addr_width) - 1; end = (((u64)1) addr_width) - 1; /* in case it's partial page */ start = PAGE_ALIGN(start); end = PAGE_MASK; + npages = (end - start) / VTD_PAGE_SIZE; /* we don't need lock here, nobody else touches the iova range */ - while (start end) { + while (npages--) { dma_pte_clear_one(domain, start); start += VTD_PAGE_SIZE; } @@ -2277,7 +2279,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr, error: if (iova) __free_iova(domain-iovad, iova); - printk(KERN_ERRDevice %s request: %...@%llx dir %d --- failed\n, + printk(KERN_ERRDevice %s request: %...@%llx dir %d --- failed\n, pci_name(pdev), size, (unsigned long long)paddr, dir); return 0; } @@ -2373,7 +2375,7 @@ void intel_unmap_single(struct device *dev, dma_addr_t dev_addr, size_t size, start_addr = iova-pfn_lo PAGE_SHIFT; size = aligned_size((u64)dev_addr, size); - pr_debug(Device %s unmapping: %...@%llx\n, + pr_debug(Device %s unmapping: %...@%llx\n, pci_name(pdev), size, (unsigned long long)start_addr); /* clear the whole page */ @@ -2431,8 +2433,6 @@ void intel_free_coherent(struct device *hwdev, size_t size, void *vaddr, free_pages((unsigned long)vaddr, order); } -#define SG_ENT_VIRT_ADDRESS(sg)(sg_virt((sg))) - void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, int dir) { @@ -2442,7 +2442,7 @@ void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, unsigned long start_addr; struct iova *iova; size_t size = 0; - void *addr; + phys_addr_t addr; struct scatterlist *sg; struct intel_iommu *iommu; @@ -2458,7 +2458,7 @@ void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, if (!iova) return; for_each_sg(sglist, sg, nelems, i) { - addr = SG_ENT_VIRT_ADDRESS(sg); + addr = page_to_phys(sg_page(sg)) + sg-offset; size += aligned_size((u64)addr, sg-length); } @@ -2485,7 +2485,7 @@ static int intel_nontranslate_map_sg(struct device *hddev, for_each_sg(sglist, sg, nelems, i) { BUG_ON(!sg_page(sg)); - sg-dma_address = virt_to_bus(SG_ENT_VIRT_ADDRESS(sg)); + sg-dma_address = page_to_phys(sg_page(sg)) + sg-offset; sg-dma_length = sg-length; } return nelems; @@ -2494,7 +2494,7 @@ static int intel_nontranslate_map_sg(struct device *hddev, int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, int dir) { - void *addr; + phys_addr_t addr; int i; struct pci_dev *pdev = to_pci_dev(hwdev); struct dmar_domain *domain; @@ -2518,8 +2518,7 @@ int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, iommu = domain_get_iommu(domain); for_each_sg(sglist, sg, nelems, i) { - addr = SG_ENT_VIRT_ADDRESS(sg); - addr = (void *)virt_to_phys(addr); + addr = page_to_phys(sg_page(sg)) + sg-offset; size += aligned_size((u64)addr, sg-length); } @@ -2542,8 +2541,7 @@ int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, start_addr = iova-pfn_lo PAGE_SHIFT; offset = 0
[PATCH v9 0/7] PCI: Linux kernel SR-IOV support
Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. SR-IOV specification can be found at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf (it requires membership.) Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf Physical Function driver for Intel 82576 NIC (based on drivers/net/igb/) will come soon. Major changes from v8 to v9: 1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen) 2, block user config accesses before clearing VF Enable bit (Matthew Wilcox) Yu Zhao (7): PCI: initialize and release SR-IOV capability PCI: restore saved SR-IOV state PCI: reserve bus range for SR-IOV device PCI: add SR-IOV API for Physical Function driver PCI: handle SR-IOV Virtual Function Migration PCI: document SR-IOV sysfs entries PCI: manual for SR-IOV user and driver developer Documentation/ABI/testing/sysfs-bus-pci | 27 ++ Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + drivers/pci/Kconfig | 13 + drivers/pci/Makefile|3 + drivers/pci/iov.c | 707 +++ drivers/pci/pci.c |8 + drivers/pci/pci.h | 53 +++ drivers/pci/probe.c |7 + include/linux/pci.h | 28 ++ include/linux/pci_regs.h| 33 ++ 11 files changed, 979 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt create mode 100644 drivers/pci/iov.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 2/7] PCI: restore saved SR-IOV state
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 25 + drivers/pci/pci.c |1 + drivers/pci/pci.h |4 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index e6736d4..1cc879b 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -128,6 +128,21 @@ static void sriov_release(struct pci_dev *dev) dev-sriov = NULL; } +static void sriov_restore_state(struct pci_dev *dev) +{ + u16 ctrl; + struct pci_sriov *iov = dev-sriov; + + pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) + return; + + pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + if (iov-ctrl PCI_SRIOV_CTRL_VFE) + msleep(100); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -179,3 +194,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno, return dev-sriov-pos + PCI_SRIOV_BAR + 4 * (resno - PCI_SRIOV_RESOURCES); } + +/** + * pci_restore_iov_state - restore the state of the IOV capability + * @dev: the PCI device + */ +void pci_restore_iov_state(struct pci_dev *dev) +{ + if (dev-sriov) + sriov_restore_state(dev); +} diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index c4f14f3..f791dcf 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev) } pci_restore_pcix_state(dev); pci_restore_msi_state(dev); + pci_restore_iov_state(dev); return 0; } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d2dc6b7..9d76737 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern void pci_restore_iov_state(struct pci_dev *dev); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +static inline void pci_restore_iov_state(struct pci_dev *dev) +{ +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 1/7] PCI: initialize and release SR-IOV capability
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/Kconfig | 13 drivers/pci/Makefile |3 + drivers/pci/iov.c| 181 ++ drivers/pci/pci.c|7 ++ drivers/pci/pci.h| 37 ++ drivers/pci/probe.c |4 + include/linux/pci.h |8 ++ include/linux/pci_regs.h | 33 + 8 files changed, 286 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 2a4501d..e8ea3e8 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -59,3 +59,16 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool PCI IOV support + depends on PCI + select PCI_MSI + default n + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the Physical Function driver to enable + the hardware capability, so the Virtual Function is accessible + via the PCI Configuration Space using its own Bus, Device and + Function Numbers. Each Virtual Function also has the PCI Memory + Space to map the device specific register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba99282 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +# PCI IOV support +obj-$(CONFIG_PCI_IOV) += iov.o + # # Some architectures use the generic PCI setup functions # diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 000..e6736d4 --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,181 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com + * + * PCI Express I/O Virtualization (IOV) support. + * Single Root IOV 1.0 + */ + +#include linux/pci.h +#include linux/mutex.h +#include linux/string.h +#include linux/delay.h +#include pci.h + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + ssleep(1); + } + + pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total); + if (!total) + return 0; + + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-sriov) + break; + if (list_empty(dev-bus-devices) || !pdev-sriov) + pdev = NULL; + + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; + + pgsz = ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + i += __pci_read_base(dev, pci_bar_unknown, res, +pos + PCI_SRIOV_BAR + i * 4); + if (!res-flags) + continue; + if (resource_size(res) (PAGE_SIZE - 1)) { + rc = -EIO; + goto failed; + } + res-end = res-start + resource_size(res) * total - 1; + nres++; + } + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) { + rc = -ENOMEM; + goto failed; + } + + iov-pos = pos; + iov-nres = nres; + iov-ctrl = ctrl; + iov-total = total; + iov-offset = offset; + iov-stride = stride; + iov-pgsz = pgsz; + iov-self = dev; + pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap); + pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link); + + if (pdev) + iov-pdev = pci_dev_get(pdev); + else { + iov-pdev = dev
[PATCH v9 3/7] PCI: reserve bus range for SR-IOV device
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 34 ++ drivers/pci/pci.h |5 + drivers/pci/probe.c |3 +++ 3 files changed, 42 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 1cc879b..c89fcb1 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -14,6 +14,16 @@ #include pci.h +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) +{ + u16 bdf; + + bdf = (dev-bus-number 8) + dev-devfn + + dev-sriov-offset + dev-sriov-stride * id; + *busnr = bdf 8; + *devfn = bdf 0xff; +} + static int sriov_init(struct pci_dev *dev, int pos) { int i; @@ -204,3 +214,27 @@ void pci_restore_iov_state(struct pci_dev *dev) if (dev-sriov) sriov_restore_state(dev); } + +/** + * pci_iov_bus_range - find bus range used by Virtual Function + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + if (!dev-sriov) + continue; + virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn); + if (busnr max) + max = busnr; + } + + return max ? max - bus-number : 0; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9d76737..fdfc476 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, static inline void pci_restore_iov_state(struct pci_dev *dev) { } +static inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 03b6f29..4c8abd0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 5/7] PCI: handle SR-IOV Virtual Function Migration
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 119 +++ drivers/pci/pci.h |4 ++ include/linux/pci.h |6 +++ 3 files changed, 129 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index e4e2dac..127f643 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -206,6 +206,97 @@ static void sriov_release_dev(struct device *dev) iov-nr_virtfn = 0; } +static int sriov_migration(struct pci_dev *dev) +{ + u16 status; + struct pci_sriov *iov = dev-sriov; + + if (!iov-nr_virtfn) + return 0; + + if (!(iov-cap PCI_SRIOV_CAP_VFM)) + return 0; + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + if (!(status PCI_SRIOV_STATUS_VFM)) + return 0; + + schedule_work(iov-mtask); + + return 1; +} + +static void sriov_migration_task(struct work_struct *work) +{ + int i; + u8 state; + u16 status; + struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask); + + for (i = iov-initial; i iov-nr_virtfn; i++) { + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_MI) { + writeb(PCI_SRIOV_VFM_AV, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 1); + } else if (state == PCI_SRIOV_VFM_MO) { + virtfn_remove(iov-self, i, 1); + writeb(PCI_SRIOV_VFM_UA, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 0); + } + } + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + status = ~PCI_SRIOV_STATUS_VFM; + pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); +} + +static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn) +{ + int bir; + u32 table; + resource_size_t pa; + struct pci_sriov *iov = dev-sriov; + + if (nr_virtfn = iov-initial) + return 0; + + pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table); + bir = PCI_SRIOV_VFM_BIR(table); + if (bir PCI_STD_RESOURCE_END) + return -EIO; + + table = PCI_SRIOV_VFM_OFFSET(table); + if (table + nr_virtfn pci_resource_len(dev, bir)) + return -EIO; + + pa = pci_resource_start(dev, bir) + table; + iov-mstate = ioremap(pa, nr_virtfn); + if (!iov-mstate) + return -ENOMEM; + + INIT_WORK(iov-mtask, sriov_migration_task); + + iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR; + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + return 0; +} + +static void sriov_disable_migration(struct pci_dev *dev) +{ + struct pci_sriov *iov = dev-sriov; + + iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + cancel_work_sync(iov-mtask); + iounmap(iov-mstate); +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -294,6 +385,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) goto failed2; } + if (iov-cap PCI_SRIOV_CAP_VFM) { + rc = sriov_enable_migration(dev, nr_virtfn); + if (rc) + goto failed2; + } + kobject_uevent(dev-dev.kobj, KOBJ_CHANGE); iov-nr_virtfn = nr_virtfn; @@ -325,6 +422,9 @@ static void sriov_disable(struct pci_dev *dev) if (!iov-nr_virtfn) return; + if (iov-cap PCI_SRIOV_CAP_VFM) + sriov_disable_migration(dev); + for (i = 0; i iov-nr_virtfn; i++) virtfn_remove(dev, i, 0); @@ -586,3 +686,22 @@ void pci_disable_sriov(struct pci_dev *dev) sriov_disable(dev); } EXPORT_SYMBOL_GPL(pci_disable_sriov); + +/** + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration + * @dev: the PCI device + * + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not. + * + * Physical Function driver is responsible to register IRQ handler using + * VF Migration Interrupt Message Number, and call this function when the + * interrupt is generated by the hardware. + */ +irqreturn_t pci_sriov_migration(struct pci_dev *dev) +{ + if (!dev-sriov) + return IRQ_NONE; + + return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; +} +EXPORT_SYMBOL_GPL(pci_sriov_migration); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 328a611..51bebb2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -1,6 +1,8 @@ #ifndef
[PATCH v9 4/7] PCI: add SR-IOV API for Physical Function driver
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 348 +++ drivers/pci/pci.h |3 + include/linux/pci.h | 14 ++ 3 files changed, 365 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index c89fcb1..e4e2dac 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -13,6 +13,8 @@ #include linux/delay.h #include pci.h +#define VIRTFN_ID_LEN 8 + static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) { @@ -24,6 +26,319 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) *devfn = bdf 0xff; } +static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) +{ + int rc; + struct pci_bus *child; + + if (bus-number == busnr) + return bus; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + if (child) + return child; + + child = pci_add_new_bus(bus, NULL, busnr); + if (!child) + return NULL; + + child-subordinate = busnr; + child-dev.parent = bus-bridge; + rc = pci_bus_add_child(child); + if (rc) { + pci_remove_bus(child); + return NULL; + } + + return child; +} + +static void virtfn_remove_bus(struct pci_bus *bus, int busnr) +{ + struct pci_bus *child; + + if (bus-number == busnr) + return; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + BUG_ON(!child); + + if (list_empty(child-devices)) + pci_remove_bus(child); +} + +static int virtfn_add(struct pci_dev *dev, int id, int reset) +{ + int i; + int rc; + u64 size; + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct pci_dev *virtfn; + struct resource *res; + struct pci_sriov *iov = dev-sriov; + + virtfn = alloc_pci_dev(); + if (!virtfn) + return -ENOMEM; + + virtfn_bdf(dev, id, busnr, devfn); + mutex_lock(iov-pdev-sriov-lock); + virtfn-bus = virtfn_add_bus(dev-bus, busnr); + if (!virtfn-bus) { + kfree(virtfn); + mutex_unlock(iov-pdev-sriov-lock); + return -ENOMEM; + } + + virtfn-sysdata = dev-bus-sysdata; + virtfn-dev.parent = dev-dev.parent; + virtfn-dev.bus = dev-dev.bus; + virtfn-devfn = devfn; + virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL; + virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE; + virtfn-error_state = pci_channel_io_normal; + virtfn-current_state = PCI_UNKNOWN; + virtfn-is_pcie = 1; + virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT; + virtfn-dma_mask = 0x; + virtfn-vendor = dev-vendor; + virtfn-subsystem_vendor = dev-subsystem_vendor; + virtfn-class = dev-class; + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device); + pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision); + pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID, +virtfn-subsystem_device); + + dev_set_name(virtfn-dev, %04x:%02x:%02x.%d, +pci_domain_nr(virtfn-bus), busnr, +PCI_SLOT(devfn), PCI_FUNC(devfn)); + + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + if (!res-parent) + continue; + virtfn-resource[i].name = pci_name(virtfn); + virtfn-resource[i].flags = res-flags; + size = resource_size(res); + do_div(size, iov-total); + virtfn-resource[i].start = res-start + size * id; + virtfn-resource[i].end = virtfn-resource[i].start + size - 1; + rc = request_resource(res, virtfn-resource[i]); + BUG_ON(rc); + } + + if (reset) + pci_execute_reset_function(virtfn); + + pci_device_add(virtfn, virtfn-bus); + mutex_unlock(iov-pdev-sriov-lock); + + virtfn-physfn = pci_dev_get(dev); + + rc = pci_bus_add_device(virtfn); + if (rc) + goto failed1; + sprintf(buf, %d, id); + rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf); + if (rc) + goto failed1; + rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn); + if (rc) + goto failed2; + + kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE); + + return 0; + +failed2: + sysfs_remove_link(iov-dev.kobj, buf); +failed1: + pci_dev_put(dev); + mutex_lock(iov-pdev-sriov-lock); + pci_remove_bus_device(virtfn); + virtfn_remove_bus(dev-bus, busnr); + mutex_unlock(iov-pdev-sriov-lock); + + return rc; +} + +static void virtfn_remove(struct pci_dev *dev, int id, int reset) +{ + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct
[PATCH v9 6/7] PCI: document SR-IOV sysfs entries
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/ABI/testing/sysfs-bus-pci | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ceddcff..84dc100 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -9,3 +9,30 @@ Description: that some devices may have malformatted data. If the underlying VPD has a writable section then the corresponding section of this file will be writable. + +What: /sys/bus/pci/devices/.../virtfn/N +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it. + The symbol link points to the PCI device sysfs entry of + Virtual Function whose index is N (0...MaxVFs-1). + +What: /sys/bus/pci/devices/.../virtfn/dep_link +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it, + and this device has vendor specific dependencies with + others. The symbol link points to the PCI device sysfs + entry of Physical Function this device depends on. + +What: /sys/bus/pci/devices/.../physfn +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when a device is Virtual Function. + The symbol link points to the PCI device sysfs entry of + Physical Function this device associates with. -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 7/7] PCI: manual for SR-IOV user and driver developer
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 99 + 2 files changed, 100 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 5818ff7..506e611 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c -- !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c /sect1 sect1titlePCI Hotplug Support Library/title !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000..fc73ef5 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,99 @@ + PCI Express I/O Virtualization Howto + Copyright (C) 2009 Intel Corporation + Yu Zhao yu.z...@intel.com + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +2. User Guide + +2.1 How can I enable SR-IOV capability + +The device driver (PF driver) will control the enabling and disabling +of the capability via API provided by SR-IOV core. If the hardware +has SR-IOV capability, loading its PF driver would enable it and all +VFs associated with the PF. + +2.2 How can I use the Virtual Functions + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +3. Developer Guide + +3.1 SR-IOV API + +To enable SR-IOV capability: + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + 'nr_virtfn' is number of VFs to be enabled. + +To disable SR-IOV capability: + void pci_disable_sriov(struct pci_dev *dev); + +To notify SR-IOV core of Virtual Function Migration: + irqreturn_t pci_sriov_migration(struct pci_dev *dev); + +3.2 Usage example + +Following piece of code illustrates the usage of the SR-IOV API. + +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) +{ + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + pci_disable_sriov(dev); + + ... +} + +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + ... + + return 0; +} + +static void dev_shutdown(struct pci_dev *dev) +{ + ... +} + +static struct pci_driver dev_driver = { + .name = SR-IOV Physical Function driver, + .id_table = dev_id_table, + .probe =dev_probe, + .remove = __devexit_p(dev_remove), + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, +}; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability
On Sat, Feb 14, 2009 at 12:56:44AM +0800, Andi Kleen wrote: Yu Zhao yu.z...@intel.com writes: + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + It would be a good idea to put a might_sleep() here just in case the msleep happens below and drivers call it incorrectly. Yes, will do. + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + msleep(100); That's really long. Hopefully that's really needed. It's needed according to SR-IOV spec, however, these lines clear the VF Enable bit if the BIOS or something else has set it. So it doesn't always run into this. + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; All the error paths don't seem to undo the config space writes. How will the devices behave with half initialized context? Since the VF Enable bit is cleared before the initialization, setting others SR-IOV registers won't change state of the device. So it should be OK even without undo these writes as long as the VF Enable bit is not set. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability
On Sat, Feb 14, 2009 at 01:49:59AM +0800, Matthew Wilcox wrote: On Fri, Feb 13, 2009 at 05:56:44PM +0100, Andi Kleen wrote: + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + msleep(100); That's really long. Hopefully that's really needed. Yes and no. The spec says: To allow components to perform internal initialization, system software must wait for at least 100 ms after changing the VF Enable bit from a 0 to a 1, before it is permitted to issue Configuration Requests to the VFs which are enabled by that VF Enable bit. So we don't have to wait here, but we do have to wait before exposing all these virtual functions to the rest of the system. Should we add more complexity, perhaps spawn a thread to do it asynchronously, or add 0.1 seconds to device initialisation? A question without an easy answer, iMO. This clears the VF Enable bit only if the BIOS has set it, so it doesn't always happen. Actually the `msleep(100)' should be `ssleep(1)' here, according to the spec you showed us below. I remembered the waiting time incorrectly as 100ms which is the requirment for setting the VF Enable bit rather than clearing it. + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; All the error paths don't seem to undo the config space writes. How will the devices behave with half initialized context? I think we should clear the VF_ENABLE bit. That action is also fraught with danger: The VF Eanble bit hasn't been set yet :-) Actually the spec forbids the s/w to write those registers (NumVFs, Supported Page Size, etc.) when the enabling bit is set. If software Clears VF Enable, software must allow 1 second after VF Enable is Cleared before reading any field in the SR-IOV Extended Capability or the VF Migration State Array (see Section 3.3.15.1). Another msleep(1000) here? Not pretty, but what else can we do? Not to mention the danger of something else innocently using lspci - to read a field in the extended capability -- I suspect we also need to block user config accesses before clearing this bit. Yes, we should block user config access. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/6] ATS capability support for Intel IOMMU
This patch series implements Address Translation Service support for the Intel IOMMU. ATS makes the PCI Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the Endpoint, thus alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. Changelog: v2 - v3 1, throw error message if VT-d hardware detects invalid descriptor on Queued Invalidation interface (David Woodhouse) 2, avoid using pci_find_ext_capability every time when reading ATS Invalidate Queue Depth (Matthew Wilcox) Changelog: v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add queue invalidation fault status support VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c | 230 ++ drivers/pci/intel-iommu.c| 135 - drivers/pci/intr_remapping.c | 21 ++-- drivers/pci/pci.c| 72 + include/linux/dmar.h |9 ++ include/linux/intel-iommu.h | 19 +++- include/linux/pci.h | 16 +++ include/linux/pci_regs.h | 10 ++ 8 files changed, 457 insertions(+), 55 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/6] PCI: support the ATS capability
The ATS spec can be found at http://www.pcisig.com/specifications/iov/ats/ (it requires membership). Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/pci.c| 72 ++ include/linux/pci.h | 16 ++ include/linux/pci_regs.h | 10 ++ 3 files changed, 98 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e3efe6b..87018ab 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1462,6 +1462,78 @@ void pci_enable_ari(struct pci_dev *dev) } /** + * pci_enable_ats - enable the ATS capability + * @dev: the PCI device + * @ps: the IOMMU page shift + * + * Returns 0 on success, or a negative value on error. + */ +int pci_enable_ats(struct pci_dev *dev, int ps) +{ + int pos; + u16 ctrl; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + if (ps PCI_ATS_MIN_STU) + return -EINVAL; + + ctrl = PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU) | PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl); + + dev-ats_enabled = 1; + + return 0; +} + +/** + * pci_disable_ats - disable the ATS capability + * @dev: the PCI device + */ +void pci_disable_ats(struct pci_dev *dev) +{ + int pos; + u16 ctrl; + + if (!dev-ats_enabled) + return; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return; + + pci_read_config_word(dev, pos + PCI_ATS_CTRL, ctrl); + ctrl = ~PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl); +} + +/** + * pci_ats_queue_depth - query ATS Invalidate Queue Depth + * @dev: the PCI device + * + * Returns the queue depth on success, or 0 on error. + */ +int pci_ats_queue_depth(struct pci_dev *dev) +{ + int pos; + u16 cap; + + if (dev-ats_qdep) + return dev-ats_qdep; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return 0; + + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + dev-ats_qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : +PCI_ATS_MAX_QDEP; + return dev-ats_qdep; +} + +/** * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge * @dev: the PCI device * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD) diff --git a/include/linux/pci.h b/include/linux/pci.h index 7bd624b..cab680b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -254,6 +254,7 @@ struct pci_dev { unsigned intmsi_enabled:1; unsigned intmsix_enabled:1; unsigned intari_enabled:1; /* ARI forwarding */ + unsigned intats_enabled:1; /* Address Translation Service */ unsigned intis_managed:1; unsigned intis_pcie:1; unsigned intstate_saved:1; @@ -270,6 +271,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + int ats_qdep; /* ATS Invalidate Queue Depth */ }; extern struct pci_dev *alloc_pci_dev(void); @@ -1194,5 +1196,19 @@ int pci_ext_cfg_avail(struct pci_dev *dev); void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar); +extern int pci_enable_ats(struct pci_dev *dev, int ps); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_queue_depth(struct pci_dev *dev); +/** + * pci_ats_enabled - query the ATS status + * @dev: the PCI device + * + * Returns 1 if ATS capability is enabled, or 0 if not. + */ +static inline int pci_ats_enabled(struct pci_dev *dev) +{ + return dev-ats_enabled; +} + #endif /* __KERNEL__ */ #endif /* LINUX_PCI_H */ diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index 027815b..3858b4f 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -498,6 +498,7 @@ #define PCI_EXT_CAP_ID_DSN 3 #define PCI_EXT_CAP_ID_PWR 4 #define PCI_EXT_CAP_ID_ARI 14 +#define PCI_EXT_CAP_ID_ATS 15 /* Advanced Error Reporting */ #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */ @@ -615,4 +616,13 @@ #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */ #define PCI_ARI_CTRL_FG(x)(((x) 4) 7) /* Function Group */ +/* Address Translation Service */ +#define PCI_ATS_CAP0x04/* ATS Capability Register */ +#define PCI_ATS_CAP_QDEP(x) ((x) 0x1f)/* Invalidate Queue Depth */ +#define PCI_ATS_MAX_QDEP 32 /* Max Invalidate Queue Depth */ +#define PCI_ATS_CTRL 0x06/* ATS Control Register */ +#define PCI_ATS_CTRL_ENABLE 0x8000 /* ATS Enable */ +#define PCI_ATS_CTRL_STU(x) ((x) 0x1f)/* Smallest Translation Unit */ +#define PCI_ATS_MIN_STU 12 /* shift of minimum STU block */ + #endif /* LINUX_PCI_REGS_H */ -- 1.6.1 -- To unsubscribe from
[PATCH v3 3/6] VT-d: add queue invalidation fault status support
Check fault register after submitting an queue invalidation request. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 63 -- drivers/pci/intr_remapping.c | 21 -- include/linux/intel-iommu.h |4 ++- 3 files changed, 63 insertions(+), 25 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index bd37b3c..66dda07 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -671,19 +671,53 @@ static inline void reclaim_free_desc(struct q_inval *qi) } } +static int qi_check_fault(struct intel_iommu *iommu, int index) +{ + u32 fault; + int head; + struct q_inval *qi = iommu-qi; + int wait_index = (index + 1) % QI_LENGTH; + + fault = readl(iommu-reg + DMAR_FSTS_REG); + + /* +* If IQE happens, the head points to the descriptor associated +* with the error. No new descriptors are fetched until the IQE +* is cleared. +*/ + if (fault DMA_FSTS_IQE) { + head = readl(iommu-reg + DMAR_IQH_REG); + if ((head DMAR_IQ_OFFSET) == index) { + printk(KERN_ERR VT-d detected invalid descriptor: + low=%llx, high=%llx\n, + (unsigned long long)qi-desc[index].low, + (unsigned long long)qi-desc[index].high); + memcpy(qi-desc[index], qi-desc[wait_index], + sizeof(struct qi_desc)); + __iommu_flush_cache(iommu, qi-desc[index], + sizeof(struct qi_desc)); + writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG); + return -EINVAL; + } + } + + return 0; +} + /* * Submit the queued invalidation descriptor to the remapping * hardware unit and wait for its completion. */ -void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) +int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { + int rc = 0; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; unsigned long flags; if (!qi) - return; + return 0; hw = qi-desc; @@ -701,7 +735,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw[index] = *desc; - wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | QI_IWD_TYPE; + wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) | + QI_IWD_STATUS_WRITE | QI_IWD_TYPE; wait_desc.high = virt_to_phys(qi-desc_status[wait_index]); hw[wait_index] = wait_desc; @@ -712,13 +747,11 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) qi-free_head = (qi-free_head + 2) % QI_LENGTH; qi-free_cnt -= 2; - spin_lock(iommu-register_lock); /* * update the HW tail register indicating the presence of * new descriptors. */ - writel(qi-free_head 4, iommu-reg + DMAR_IQT_REG); - spin_unlock(iommu-register_lock); + writel(qi-free_head DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG); while (qi-desc_status[wait_index] != QI_DONE) { /* @@ -728,6 +761,10 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ + rc = qi_check_fault(iommu, index); + if (rc) + break; + spin_unlock(qi-q_lock); cpu_relax(); spin_lock(qi-q_lock); @@ -737,6 +774,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + + return rc; } /* @@ -749,13 +788,13 @@ void qi_global_iec(struct intel_iommu *iommu) desc.low = QI_IEC_TYPE; desc.high = 0; + /* should never fail */ qi_submit_sync(desc, iommu); } int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type, int non_present_entry_flush) { - struct qi_desc desc; if (non_present_entry_flush) { @@ -769,10 +808,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, | QI_CC_GRAN(type) | QI_CC_TYPE; desc.high = 0; - qi_submit_sync(desc, iommu); - - return 0; - + return qi_submit_sync(desc, iommu); } int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, @@ -802,10 +838,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih) | QI_IOTLB_AM(size_order
[PATCH v3 4/6] VT-d: add device IOTLB invalidation support
Support device IOTLB invalidation to flush the translation cached in the Endpoint. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 63 -- include/linux/intel-iommu.h | 13 - 2 files changed, 72 insertions(+), 4 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index 66dda07..93b38e7 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -664,7 +664,8 @@ void free_iommu(struct intel_iommu *iommu) */ static inline void reclaim_free_desc(struct q_inval *qi) { - while (qi-desc_status[qi-free_tail] == QI_DONE) { + while (qi-desc_status[qi-free_tail] == QI_DONE || + qi-desc_status[qi-free_tail] == QI_ABORT) { qi-desc_status[qi-free_tail] = QI_FREE; qi-free_tail = (qi-free_tail + 1) % QI_LENGTH; qi-free_cnt++; @@ -674,10 +675,13 @@ static inline void reclaim_free_desc(struct q_inval *qi) static int qi_check_fault(struct intel_iommu *iommu, int index) { u32 fault; - int head; + int head, tail; struct q_inval *qi = iommu-qi; int wait_index = (index + 1) % QI_LENGTH; + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + fault = readl(iommu-reg + DMAR_FSTS_REG); /* @@ -701,6 +705,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) } } + /* +* If ITE happens, all pending wait_desc commands are aborted. +* No new descriptors are fetched until the ITE is cleared. +*/ + if (fault DMA_FSTS_ITE) { + head = readl(iommu-reg + DMAR_IQH_REG); + head = ((head DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + head |= 1; + tail = readl(iommu-reg + DMAR_IQT_REG); + tail = ((tail DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + + writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG); + + do { + if (qi-desc_status[head] == QI_IN_USE) + qi-desc_status[head] = QI_ABORT; + head = (head - 2 + QI_LENGTH) % QI_LENGTH; + } while (head != tail); + + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + } + + if (fault DMA_FSTS_ICE) + writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG); + return 0; } @@ -710,7 +740,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { - int rc = 0; + int rc; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; @@ -721,6 +751,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw = qi-desc; +restart: + rc = 0; + spin_lock_irqsave(qi-q_lock, flags); while (qi-free_cnt 3) { spin_unlock_irqrestore(qi-q_lock, flags); @@ -775,6 +808,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + if (rc == -EAGAIN) + goto restart; + return rc; } @@ -841,6 +877,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, return qi_submit_sync(desc, iommu); } +int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep, + u64 addr, unsigned mask) +{ + struct qi_desc desc; + + if (mask) { + BUG_ON(addr ((1 (VTD_PAGE_SHIFT + mask)) - 1)); + addr |= (1 (VTD_PAGE_SHIFT + mask - 1)) - 1; + desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE; + } else + desc.high = QI_DEV_IOTLB_ADDR(addr); + + if (qdep = QI_DEV_IOTLB_MAX_INVS) + qdep = 0; + + desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) | + QI_DIOTLB_TYPE; + + return qi_submit_sync(desc, iommu); +} + /* * Enable Queued Invalidation interface. This is a must to support * interrupt-remapping. Also used by DMA-remapping, which replaces diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 0a220c9..d82bdac 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val) #define DMA_FSTS_PPF ((u32)2) #define DMA_FSTS_PFO ((u32)1) #define DMA_FSTS_IQE (1 4) +#define DMA_FSTS_ICE (1 5) +#define DMA_FSTS_ITE (1 6) #define dma_fsts_fault_record_index(s) (((s) 8) 0xff) /* FRCD_REG, 32 bits access */ @@ -224,7 +226,8 @@ do { \ enum { QI_FREE, QI_IN_USE, - QI_DONE + QI_DONE, + QI_ABORT }; #define
[PATCH v3 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 46 +--- 1 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index f4b7c79..5fdbed3 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -925,30 +925,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { - unsigned int mask; + int rc; + unsigned int mask = ilog2(__roundup_pow_of_two(pages)); BUG_ON(addr (~VTD_PAGE_MASK)); BUG_ON(pages == 0); - /* Fallback to domain selective flush if no PSI support */ - if (!cap_pgsel_inv(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, - non_present_entry_flush); - /* +* Fallback to domain selective flush if no PSI support or the size is +* too big. * PSI requires page size to be 2 ^ x, and the base address is naturally * aligned to the size */ - mask = ilog2(__roundup_pow_of_two(pages)); - /* Fallback to domain selective flush if size is too big */ - if (mask cap_max_amask_val(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, non_present_entry_flush); - - return iommu-flush.flush_iotlb(iommu, did, addr, mask, - DMA_TLB_PSI_FLUSH, - non_present_entry_flush); + if (!cap_pgsel_inv(iommu-cap) || mask cap_max_amask_val(iommu-cap)) + rc = iommu-flush.flush_iotlb(iommu, did, 0, 0, + DMA_TLB_DSI_FLUSH, + non_present_entry_flush); + else + rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, + DMA_TLB_PSI_FLUSH, + non_present_entry_flush); + return rc; } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) @@ -2301,15 +2298,16 @@ static void flush_unmaps(void) if (!iommu) continue; - if (deferred_flush[i].next) { - iommu-flush.flush_iotlb(iommu, 0, 0, 0, -DMA_TLB_GLOBAL_FLUSH, 0); - for (j = 0; j deferred_flush[i].next; j++) { - __free_iova(deferred_flush[i].domain[j]-iovad, - deferred_flush[i].iova[j]); - } - deferred_flush[i].next = 0; + if (!deferred_flush[i].next) + continue; + + iommu-flush.flush_iotlb(iommu, 0, 0, 0, +DMA_TLB_GLOBAL_FLUSH, 0); + for (j = 0; j deferred_flush[i].next; j++) { + __free_iova(deferred_flush[i].domain[j]-iovad, + deferred_flush[i].iova[j]); } + deferred_flush[i].next = 0; } list_size = 0; -- 1.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/6] VT-d: support the device IOTLB
Support device IOTLB (i.e. ATS) for both native and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 95 +- include/linux/intel-iommu.h |1 + 2 files changed, 93 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 5fdbed3..fe09e7a 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct context_entry *context) } #define CONTEXT_TT_MULTI_LEVEL 0 +#define CONTEXT_TT_DEV_IOTLB 1 static inline void context_set_translation_type(struct context_entry *context, unsigned long value) @@ -240,6 +241,7 @@ struct device_domain_info { struct list_head global; /* link to global list */ u8 bus; /* PCI bus numer */ u8 devfn; /* PCI devfn number */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -922,6 +924,74 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, return 0; } +static struct device_domain_info * +iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-dev info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + if (!pci_ats_queue_depth(info-dev)) + return NULL; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (info-dev pci_ats_enabled(info-dev)) + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned mask) +{ + int rc; + u16 sid, qdep; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + qdep = pci_ats_queue_depth(info-dev); + rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask); + if (rc) + printk(KERN_ERR IOMMU: flush device IOTLB failed\n); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { @@ -945,6 +1015,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH, non_present_entry_flush); + if (!rc !non_present_entry_flush) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); + return rc; } @@ -1469,6 +1542,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1534,7 +1608,11 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, bus, devfn); + if (info) + context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB); + else + context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); context_set_fault_enable(context); context_set_present(context); domain_flush_cache(domain, context, sizeof(*context)); @@ -1546,6
[PATCH v3 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 112 -- include/linux/dmar.h|9 include/linux/intel-iommu.h |1 + 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f5a662a..bd37b3c 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +static LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int rc; + struct acpi_dmar_atsr *atsr; + + if (atsru-include_all) + return 0; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + rc = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + if (rc || !atsru-devices_cnt) { + list_del(atsru-list); + kfree(atsru); + } + + return rc; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -341,6 +425,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n); @@ -409,11 +498,19 @@ int __init dmar_dev_scope_init(void) #ifdef CONFIG_DMAR { struct
[PATCH v8 0/7] PCI: Linux kernel SR-IOV support
Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. SR-IOV specification can be found at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf (it requires membership.) Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.neterion.com/products/x3100.html Physical Function driver for Intel 82576 NIC (based on drivers/net/igb/) will come in few weeks. Major changes from v7 to v8: 1, simplified the API for the PF driver 2, split the code and respin them against the latest tree Yu Zhao (7): PCI: initialize and release SR-IOV capability PCI: restore saved SR-IOV state PCI: reserve bus range for SR-IOV device PCI: add SR-IOV API for Physical Function driver PCI: handle SR-IOV Virtual Function Migration PCI: document SR-IOV sysfs entries PCI: manual for SR-IOV user and driver developer Documentation/ABI/testing/sysfs-bus-pci | 27 ++ Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 106 + drivers/pci/Kconfig | 13 + drivers/pci/Makefile|3 + drivers/pci/iov.c | 692 +++ drivers/pci/pci.c |8 + drivers/pci/pci.h | 53 +++ drivers/pci/probe.c |7 + include/linux/pci.h | 28 ++ include/linux/pci_regs.h| 33 ++ 11 files changed, 971 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt create mode 100644 drivers/pci/iov.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 1/7] PCI: initialize and release SR-IOV capability
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/Kconfig | 13 drivers/pci/Makefile |3 + drivers/pci/iov.c| 178 ++ drivers/pci/pci.c|7 ++ drivers/pci/pci.h| 37 ++ drivers/pci/probe.c |4 + include/linux/pci.h |8 ++ include/linux/pci_regs.h | 33 + 8 files changed, 283 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 2a4501d..2d0ca01 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -59,3 +59,16 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool PCI IOV support + depends on PCI + select PCI_MSI + default n + help + PCI-SIG I/O Virtualization (IOV) Specifications support. + Single Root IOV: allows the Physical Function driver to enable + the hardware capability, so the Virtual Function is accessible + via the PCI Configuration Space using its own Bus, Device and + Function Numbers. Each Virtual Function also has the PCI Memory + Space to map the device specific register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 3d07ce2..ba99282 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o +# PCI IOV support +obj-$(CONFIG_PCI_IOV) += iov.o + # # Some architectures use the generic PCI setup functions # diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 000..9a1fabd --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,178 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com + * + * PCI Express I/O Virtualization (IOV) support. + * Single Root IOV 1.0 + */ + +#include linux/pci.h +#include pci.h + + +static int sriov_init(struct pci_dev *dev, int pos) +{ + int i; + int rc; + int nres; + u32 pgsz; + u16 ctrl, total, offset, stride; + struct pci_sriov *iov; + struct resource *res; + struct pci_dev *pdev; + + if (dev-pcie_type != PCI_EXP_TYPE_RC_END + dev-pcie_type != PCI_EXP_TYPE_ENDPOINT) + return -ENODEV; + + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) { + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0); + msleep(100); + } + + pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total); + if (!total) + return 0; + + list_for_each_entry(pdev, dev-bus-devices, bus_list) + if (pdev-sriov) + break; + if (list_empty(dev-bus-devices) || !pdev-sriov) + pdev = NULL; + + ctrl = 0; + if (!pdev pci_ari_enabled(dev-bus)) + ctrl |= PCI_SRIOV_CTRL_ARI; + + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl); + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset); + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride); + if (!offset || (total 1 !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz); + i = PAGE_SHIFT 12 ? PAGE_SHIFT - 12 : 0; + pgsz = ~((1 i) - 1); + if (!pgsz) + return -EIO; + + pgsz = ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + + nres = 0; + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + i += __pci_read_base(dev, pci_bar_unknown, res, +pos + PCI_SRIOV_BAR + i * 4); + if (!res-flags) + continue; + if (resource_size(res) (PAGE_SIZE - 1)) { + rc = -EIO; + goto failed; + } + res-end = res-start + resource_size(res) * total - 1; + nres++; + } + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) { + rc = -ENOMEM; + goto failed; + } + + iov-pos = pos; + iov-nres = nres; + iov-ctrl = ctrl; + iov-total = total; + iov-offset = offset; + iov-stride = stride; + iov-pgsz = pgsz; + iov-self = dev; + pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap); + pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link); + + if (pdev) + iov-pdev = pci_dev_get(pdev); + else { + iov-pdev = dev; + mutex_init(iov-lock); + } + + dev-sriov = iov
[PATCH v8 3/7] PCI: reserve bus range for SR-IOV device
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 34 ++ drivers/pci/pci.h |5 + drivers/pci/probe.c |3 +++ 3 files changed, 42 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index bd389b4..1cf13be 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -11,6 +11,16 @@ #include pci.h +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) +{ + u16 bdf; + + bdf = (dev-bus-number 8) + dev-devfn + + dev-sriov-offset + dev-sriov-stride * id; + *busnr = bdf 8; + *devfn = bdf 0xff; +} + static int sriov_init(struct pci_dev *dev, int pos) { int i; @@ -201,3 +211,27 @@ void pci_restore_iov_state(struct pci_dev *dev) if (dev-sriov) sriov_restore_state(dev); } + +/** + * pci_iov_bus_range - find bus range used by Virtual Function + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + if (!dev-sriov) + continue; + virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn); + if (busnr max) + max = busnr; + } + + return max ? max - bus-number : 0; +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9d76737..fdfc476 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); extern void pci_restore_iov_state(struct pci_dev *dev); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, static inline void pci_restore_iov_state(struct pci_dev *dev) { } +static inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 03b6f29..4c8abd0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 2/7] PCI: restore saved SR-IOV state
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 25 + drivers/pci/pci.c |1 + drivers/pci/pci.h |4 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 9a1fabd..bd389b4 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -125,6 +125,21 @@ static void sriov_release(struct pci_dev *dev) dev-sriov = NULL; } +static void sriov_restore_state(struct pci_dev *dev) +{ + u16 ctrl; + struct pci_sriov *iov = dev-sriov; + + pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl); + if (ctrl PCI_SRIOV_CTRL_VFE) + return; + + pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + if (iov-ctrl PCI_SRIOV_CTRL_VFE) + msleep(100); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -176,3 +191,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno, return dev-sriov-pos + PCI_SRIOV_BAR + 4 * (resno - PCI_SRIOV_RESOURCES); } + +/** + * pci_restore_iov_state - restore the state of the IOV capability + * @dev: the PCI device + */ +void pci_restore_iov_state(struct pci_dev *dev) +{ + if (dev-sriov) + sriov_restore_state(dev); +} diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index c4f14f3..f791dcf 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev) } pci_restore_pcix_state(dev); pci_restore_msi_state(dev); + pci_restore_iov_state(dev); return 0; } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d2dc6b7..9d76737 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev); extern void pci_iov_release(struct pci_dev *dev); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern void pci_restore_iov_state(struct pci_dev *dev); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +static inline void pci_restore_iov_state(struct pci_dev *dev) +{ +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 7/7] PCI: manual for SR-IOV user and driver developer
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/DocBook/kernel-api.tmpl |1 + Documentation/PCI/pci-iov-howto.txt | 106 + 2 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 5818ff7..506e611 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c -- !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c /sect1 sect1titlePCI Hotplug Support Library/title !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000..9029369 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,106 @@ + PCI Express I/O Virtualization Howto + Copyright (C) 2009 Intel Corporation + Yu Zhao yu.z...@intel.com + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +2. User Guide + +2.1 How can I enable SR-IOV capability + +The device driver (PF driver) will control the enabling and disabling +of the capability via API provided by SR-IOV core. If the hardware +has SR-IOV capability, loading its PF driver would enable it and all +VFs associated with the PF. + +2.2 How can I use the Virtual Functions + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +3. Developer Guide + +3.1 SR-IOV API + +To enable SR-IOV capability: + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + 'nr_virtfn' is number of VFs to be enabled. + +To disable SR-IOV capability: + void pci_disable_sriov(struct pci_dev *dev); + +To notify SR-IOV core of Virtual Function Migration: + irqreturn_t pci_sriov_migration(struct pci_dev *dev); + +3.2 Usage example + +Following piece of code illustrates the usage of the SR-IOV API. + +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) +{ + + dev-current_state = PCI_D0; + + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + pci_disable_sriov(dev); + + ... +} + +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + pci_restore_state(dev); + + ... + + return 0; +} + +static void dev_shutdown(struct pci_dev *dev) +{ + ... +} + +static struct pci_driver dev_driver = { + .name = SR-IOV Physical Function driver, + .id_table = dev_id_table, + .probe =dev_probe, + .remove = __devexit_p(dev_remove), +#ifdef CONFIG_PM + .suspend = dev_suspend, + .resume = dev_resume, +#endif + .shutdown = dev_shutdown, +}; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 6/7] PCI: document SR-IOV sysfs entries
Signed-off-by: Yu Zhao yu.z...@intel.com --- Documentation/ABI/testing/sysfs-bus-pci | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ceddcff..84dc100 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -9,3 +9,30 @@ Description: that some devices may have malformatted data. If the underlying VPD has a writable section then the corresponding section of this file will be writable. + +What: /sys/bus/pci/devices/.../virtfn/N +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it. + The symbol link points to the PCI device sysfs entry of + Virtual Function whose index is N (0...MaxVFs-1). + +What: /sys/bus/pci/devices/.../virtfn/dep_link +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when hardware supports SR-IOV + capability and Physical Function driver has enabled it, + and this device has vendor specific dependencies with + others. The symbol link points to the PCI device sysfs + entry of Physical Function this device depends on. + +What: /sys/bus/pci/devices/.../physfn +Date: February 2009 +Contact: Yu Zhao yu.z...@intel.com +Description: + This symbol link appears when a device is Virtual Function. + The symbol link points to the PCI device sysfs entry of + Physical Function this device associates with. -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 5/7] PCI: handle SR-IOV Virtual Function Migration
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 119 +++ drivers/pci/pci.h |4 ++ include/linux/pci.h |6 +++ 3 files changed, 129 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index d576160..d622167 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -203,6 +203,97 @@ static void sriov_release_dev(struct device *dev) iov-nr_virtfn = 0; } +static int sriov_migration(struct pci_dev *dev) +{ + u16 status; + struct pci_sriov *iov = dev-sriov; + + if (!iov-nr_virtfn) + return 0; + + if (!(iov-cap PCI_SRIOV_CAP_VFM)) + return 0; + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + if (!(status PCI_SRIOV_STATUS_VFM)) + return 0; + + schedule_work(iov-mtask); + + return 1; +} + +static void sriov_migration_task(struct work_struct *work) +{ + int i; + u8 state; + u16 status; + struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask); + + for (i = iov-initial; i iov-nr_virtfn; i++) { + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_MI) { + writeb(PCI_SRIOV_VFM_AV, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 1); + } else if (state == PCI_SRIOV_VFM_MO) { + virtfn_remove(iov-self, i, 1); + writeb(PCI_SRIOV_VFM_UA, iov-mstate + i); + state = readb(iov-mstate + i); + if (state == PCI_SRIOV_VFM_AV) + virtfn_add(iov-self, i, 0); + } + } + + pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); + status = ~PCI_SRIOV_STATUS_VFM; + pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status); +} + +static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn) +{ + int bir; + u32 table; + resource_size_t pa; + struct pci_sriov *iov = dev-sriov; + + if (nr_virtfn = iov-initial) + return 0; + + pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table); + bir = PCI_SRIOV_VFM_BIR(table); + if (bir PCI_STD_RESOURCE_END) + return -EIO; + + table = PCI_SRIOV_VFM_OFFSET(table); + if (table + nr_virtfn pci_resource_len(dev, bir)) + return -EIO; + + pa = pci_resource_start(dev, bir) + table; + iov-mstate = ioremap(pa, nr_virtfn); + if (!iov-mstate) + return -ENOMEM; + + INIT_WORK(iov-mtask, sriov_migration_task); + + iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR; + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + return 0; +} + +static void sriov_disable_migration(struct pci_dev *dev) +{ + struct pci_sriov *iov = dev-sriov; + + iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR); + pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); + + cancel_work_sync(iov-mtask); + iounmap(iov-mstate); +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -287,6 +378,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) goto failed2; } + if (iov-cap PCI_SRIOV_CAP_VFM) { + rc = sriov_enable_migration(dev, nr_virtfn); + if (rc) + goto failed2; + } + kobject_uevent(dev-dev.kobj, KOBJ_CHANGE); iov-nr_virtfn = nr_virtfn; @@ -316,6 +413,9 @@ static void sriov_disable(struct pci_dev *dev) if (!iov-nr_virtfn) return; + if (iov-cap PCI_SRIOV_CAP_VFM) + sriov_disable_migration(dev); + for (i = 0; i iov-nr_virtfn; i++) virtfn_remove(dev, i, 0); @@ -571,3 +671,22 @@ void pci_disable_sriov(struct pci_dev *dev) sriov_disable(dev); } EXPORT_SYMBOL_GPL(pci_disable_sriov); + +/** + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration + * @dev: the PCI device + * + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not. + * + * Physical Function driver is responsible to register IRQ handler using + * VF Migration Interrupt Message Number, and call this function when the + * interrupt is generated by the hardware. + */ +irqreturn_t pci_sriov_migration(struct pci_dev *dev) +{ + if (!dev-sriov) + return IRQ_NONE; + + return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE; +} +EXPORT_SYMBOL_GPL(pci_sriov_migration); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 328a611..51bebb2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -1,6 +1,8 @@ #ifndef
[PATCH v8 4/7] PCI: add SR-IOV API for Physical Function driver
Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/iov.c | 336 +++ drivers/pci/pci.h |3 + include/linux/pci.h | 14 ++ 3 files changed, 353 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 1cf13be..d576160 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -10,6 +10,8 @@ #include linux/pci.h #include pci.h +#define VIRTFN_ID_LEN 8 + static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) { @@ -21,6 +23,311 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn) *devfn = bdf 0xff; } +static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) +{ + int rc; + struct pci_bus *child; + + if (bus-number == busnr) + return bus; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + if (child) + return child; + + child = pci_add_new_bus(bus, NULL, busnr); + if (!child) + return NULL; + + child-subordinate = busnr; + child-dev.parent = bus-bridge; + rc = pci_bus_add_child(child); + if (rc) { + pci_remove_bus(child); + return NULL; + } + + return child; +} + +static void virtfn_remove_bus(struct pci_bus *bus, int busnr) +{ + struct pci_bus *child; + + if (bus-number == busnr) + return; + + child = pci_find_bus(pci_domain_nr(bus), busnr); + BUG_ON(!child); + + if (list_empty(child-devices)) + pci_remove_bus(child); +} + +static int virtfn_add(struct pci_dev *dev, int id, int reset) +{ + int i; + int rc; + u64 size; + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct pci_dev *virtfn; + struct resource *res; + struct pci_sriov *iov = dev-sriov; + + virtfn = alloc_pci_dev(); + if (!virtfn) + return -ENOMEM; + + virtfn_bdf(dev, id, busnr, devfn); + mutex_lock(iov-pdev-sriov-lock); + virtfn-bus = virtfn_add_bus(dev-bus, busnr); + if (!virtfn-bus) { + kfree(virtfn); + mutex_unlock(iov-pdev-sriov-lock); + return -ENOMEM; + } + + virtfn-sysdata = dev-bus-sysdata; + virtfn-dev.parent = dev-dev.parent; + virtfn-dev.bus = dev-dev.bus; + virtfn-devfn = devfn; + virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL; + virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE; + virtfn-error_state = pci_channel_io_normal; + virtfn-current_state = PCI_UNKNOWN; + virtfn-is_pcie = 1; + virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT; + virtfn-dma_mask = 0x; + virtfn-vendor = dev-vendor; + virtfn-subsystem_vendor = dev-subsystem_vendor; + virtfn-class = dev-class; + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device); + pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision); + pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID, +virtfn-subsystem_device); + + dev_set_name(virtfn-dev, %04x:%02x:%02x.%d, +pci_domain_nr(virtfn-bus), busnr, +PCI_SLOT(devfn), PCI_FUNC(devfn)); + + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = dev-resource + PCI_SRIOV_RESOURCES + i; + if (!res-parent) + continue; + virtfn-resource[i].name = pci_name(virtfn); + virtfn-resource[i].flags = res-flags; + size = resource_size(res); + do_div(size, iov-total); + virtfn-resource[i].start = res-start + size * id; + virtfn-resource[i].end = virtfn-resource[i].start + size - 1; + rc = request_resource(res, virtfn-resource[i]); + BUG_ON(rc); + } + + if (reset) + pci_execute_reset_function(virtfn); + + pci_device_add(virtfn, virtfn-bus); + mutex_unlock(iov-pdev-sriov-lock); + + virtfn-physfn = pci_dev_get(dev); + + rc = pci_bus_add_device(virtfn); + if (rc) + goto failed1; + sprintf(buf, %d, id); + rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf); + if (rc) + goto failed1; + rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn); + if (rc) + goto failed2; + + kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE); + + return 0; + +failed2: + sysfs_remove_link(iov-dev.kobj, buf); +failed1: + pci_dev_put(dev); + mutex_lock(iov-pdev-sriov-lock); + pci_remove_bus_device(virtfn); + virtfn_remove_bus(dev-bus, busnr); + mutex_unlock(iov-pdev-sriov-lock); + + return rc; +} + +static void virtfn_remove(struct pci_dev *dev, int id, int reset) +{ + u8 busnr, devfn; + char buf[VIRTFN_ID_LEN]; + struct
[PATCH v2 0/6] ATS capability support for Intel IOMMU
This patch series implements Address Translation Service support for the Intel IOMMU. ATS makes the PCI Endpoint be able to request the DMA address translation from the IOMMU and cache the translation in the Endpoint, thus alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. Changelog: v1 - v2 added 'static' prefix to a local LIST_HEAD (Andrew Morton) Yu Zhao (6): PCI: support the ATS capability VT-d: parse ATSR in DMA Remapping Reporting Structure VT-d: add queue invalidation fault status support VT-d: add device IOTLB invalidation support VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps VT-d: support the device IOTLB drivers/pci/dmar.c | 226 ++ drivers/pci/intel-iommu.c| 137 +- drivers/pci/intr_remapping.c | 21 +++-- drivers/pci/pci.c| 68 + include/linux/dmar.h |9 ++ include/linux/intel-iommu.h | 19 +++- include/linux/pci.h | 15 +++ include/linux/pci_regs.h | 10 ++ 8 files changed, 450 insertions(+), 55 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 112 -- include/linux/dmar.h|9 include/linux/intel-iommu.h |1 + 3 files changed, 116 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f5a662a..bd37b3c 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +static LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int rc; + struct acpi_dmar_atsr *atsr; + + if (atsru-include_all) + return 0; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + rc = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + if (rc || !atsru-devices_cnt) { + list_del(atsru-list); + kfree(atsru); + } + + return rc; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -341,6 +425,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n); @@ -409,11 +498,19 @@ int __init dmar_dev_scope_init(void) #ifdef CONFIG_DMAR { struct
[PATCH v2 3/6] VT-d: add queue invalidation fault status support
Check fault register after submitting an queue invalidation request. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 59 +++-- drivers/pci/intr_remapping.c | 21 -- include/linux/intel-iommu.h |4 ++- 3 files changed, 59 insertions(+), 25 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index bd37b3c..0c87ebd 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -671,19 +671,49 @@ static inline void reclaim_free_desc(struct q_inval *qi) } } +static int qi_check_fault(struct intel_iommu *iommu, int index) +{ + u32 fault; + int head; + struct q_inval *qi = iommu-qi; + int wait_index = (index + 1) % QI_LENGTH; + + fault = readl(iommu-reg + DMAR_FSTS_REG); + + /* +* If IQE happens, the head points to the descriptor associated +* with the error. No new descriptors are fetched until the IQE +* is cleared. +*/ + if (fault DMA_FSTS_IQE) { + head = readl(iommu-reg + DMAR_IQH_REG); + if ((head DMAR_IQ_OFFSET) == index) { + memcpy(qi-desc[index], qi-desc[wait_index], + sizeof(struct qi_desc)); + __iommu_flush_cache(iommu, qi-desc[index], + sizeof(struct qi_desc)); + writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG); + return -EINVAL; + } + } + + return 0; +} + /* * Submit the queued invalidation descriptor to the remapping * hardware unit and wait for its completion. */ -void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) +int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { + int rc = 0; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; unsigned long flags; if (!qi) - return; + return 0; hw = qi-desc; @@ -701,7 +731,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw[index] = *desc; - wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | QI_IWD_TYPE; + wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) | + QI_IWD_STATUS_WRITE | QI_IWD_TYPE; wait_desc.high = virt_to_phys(qi-desc_status[wait_index]); hw[wait_index] = wait_desc; @@ -712,13 +743,11 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) qi-free_head = (qi-free_head + 2) % QI_LENGTH; qi-free_cnt -= 2; - spin_lock(iommu-register_lock); /* * update the HW tail register indicating the presence of * new descriptors. */ - writel(qi-free_head 4, iommu-reg + DMAR_IQT_REG); - spin_unlock(iommu-register_lock); + writel(qi-free_head DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG); while (qi-desc_status[wait_index] != QI_DONE) { /* @@ -728,6 +757,10 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ + rc = qi_check_fault(iommu, index); + if (rc) + break; + spin_unlock(qi-q_lock); cpu_relax(); spin_lock(qi-q_lock); @@ -737,6 +770,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + + return rc; } /* @@ -749,13 +784,13 @@ void qi_global_iec(struct intel_iommu *iommu) desc.low = QI_IEC_TYPE; desc.high = 0; + /* should never fail */ qi_submit_sync(desc, iommu); } int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type, int non_present_entry_flush) { - struct qi_desc desc; if (non_present_entry_flush) { @@ -769,10 +804,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, | QI_CC_GRAN(type) | QI_CC_TYPE; desc.high = 0; - qi_submit_sync(desc, iommu); - - return 0; - + return qi_submit_sync(desc, iommu); } int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, @@ -802,10 +834,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih) | QI_IOTLB_AM(size_order); - qi_submit_sync(desc, iommu); - - return 0; - + return qi_submit_sync(desc, iommu); } /* diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c index f78371b..45effc5 100644 --- a/drivers/pci/intr_remapping.c +++ b/drivers/pci/intr_remapping.c
[PATCH v2 6/6] VT-d: support the device IOTLB
Support device IOTLB (i.e. ATS) for both native and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 97 +- include/linux/intel-iommu.h |1 + 2 files changed, 95 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index df92764..fb84d82 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct context_entry *context) } #define CONTEXT_TT_MULTI_LEVEL 0 +#define CONTEXT_TT_DEV_IOTLB 1 static inline void context_set_translation_type(struct context_entry *context, unsigned long value) @@ -240,6 +241,8 @@ struct device_domain_info { struct list_head global; /* link to global list */ u8 bus; /* PCI bus numer */ u8 devfn; /* PCI devfn number */ + int qdep; /* invalidate queue depth */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -914,6 +917,75 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, return 0; } +static struct device_domain_info * +iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-dev info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + info-qdep = pci_ats_qdep(info-dev); + if (!info-qdep) + return NULL; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (info-dev pci_ats_enabled(info-dev)) + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned int mask) +{ + int rc; + u16 sid; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + rc = qi_flush_dev_iotlb(info-iommu, sid, + info-qdep, addr, mask); + if (rc) + printk(KERN_ERR IOMMU: flush device IOTLB failed\n); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { @@ -937,6 +1009,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH, non_present_entry_flush); + if (!rc !non_present_entry_flush) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); + return rc; } @@ -1461,6 +1536,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1526,7 +1602,11 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, bus, devfn); + if (info) + context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB); + else + context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); context_set_fault_enable(context); context_set_present(context
[PATCH v2 4/6] VT-d: add device IOTLB invalidation support
Support device IOTLB invalidation to flush the translation cached in the Endpoint. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 63 -- include/linux/intel-iommu.h | 13 - 2 files changed, 72 insertions(+), 4 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index 0c87ebd..4fea360 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -664,7 +664,8 @@ void free_iommu(struct intel_iommu *iommu) */ static inline void reclaim_free_desc(struct q_inval *qi) { - while (qi-desc_status[qi-free_tail] == QI_DONE) { + while (qi-desc_status[qi-free_tail] == QI_DONE || + qi-desc_status[qi-free_tail] == QI_ABORT) { qi-desc_status[qi-free_tail] = QI_FREE; qi-free_tail = (qi-free_tail + 1) % QI_LENGTH; qi-free_cnt++; @@ -674,10 +675,13 @@ static inline void reclaim_free_desc(struct q_inval *qi) static int qi_check_fault(struct intel_iommu *iommu, int index) { u32 fault; - int head; + int head, tail; struct q_inval *qi = iommu-qi; int wait_index = (index + 1) % QI_LENGTH; + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + fault = readl(iommu-reg + DMAR_FSTS_REG); /* @@ -697,6 +701,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) } } + /* +* If ITE happens, all pending wait_desc commands are aborted. +* No new descriptors are fetched until the ITE is cleared. +*/ + if (fault DMA_FSTS_ITE) { + head = readl(iommu-reg + DMAR_IQH_REG); + head = ((head DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + head |= 1; + tail = readl(iommu-reg + DMAR_IQT_REG); + tail = ((tail DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + + writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG); + + do { + if (qi-desc_status[head] == QI_IN_USE) + qi-desc_status[head] = QI_ABORT; + head = (head - 2 + QI_LENGTH) % QI_LENGTH; + } while (head != tail); + + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + } + + if (fault DMA_FSTS_ICE) + writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG); + return 0; } @@ -706,7 +736,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { - int rc = 0; + int rc; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; @@ -717,6 +747,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw = qi-desc; +restart: + rc = 0; + spin_lock_irqsave(qi-q_lock, flags); while (qi-free_cnt 3) { spin_unlock_irqrestore(qi-q_lock, flags); @@ -771,6 +804,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + if (rc == -EAGAIN) + goto restart; + return rc; } @@ -837,6 +873,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, return qi_submit_sync(desc, iommu); } +int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, int qdep, + u64 addr, unsigned int mask) +{ + struct qi_desc desc; + + if (mask) { + BUG_ON(addr ((1 (VTD_PAGE_SHIFT + mask)) - 1)); + addr |= (1 (VTD_PAGE_SHIFT + mask - 1)) - 1; + desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE; + } else + desc.high = QI_DEV_IOTLB_ADDR(addr); + + if (qdep = QI_DEV_IOTLB_MAX_INVS) + qdep = 0; + + desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) | + QI_DIOTLB_TYPE; + + return qi_submit_sync(desc, iommu); +} + /* * Enable Queued Invalidation interface. This is a must to support * interrupt-remapping. Also used by DMA-remapping, which replaces diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 0a220c9..d82bdac 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val) #define DMA_FSTS_PPF ((u32)2) #define DMA_FSTS_PFO ((u32)1) #define DMA_FSTS_IQE (1 4) +#define DMA_FSTS_ICE (1 5) +#define DMA_FSTS_ITE (1 6) #define dma_fsts_fault_record_index(s) (((s) 8) 0xff) /* FRCD_REG, 32 bits access */ @@ -224,7 +226,8 @@ do { \ enum { QI_FREE, QI_IN_USE, - QI_DONE + QI_DONE, + QI_ABORT }; #define
[PATCH v2 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 46 +--- 1 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 3dfecb2..df92764 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -917,30 +917,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { - unsigned int mask; + int rc; + unsigned int mask = ilog2(__roundup_pow_of_two(pages)); BUG_ON(addr (~VTD_PAGE_MASK)); BUG_ON(pages == 0); - /* Fallback to domain selective flush if no PSI support */ - if (!cap_pgsel_inv(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, - non_present_entry_flush); - /* +* Fallback to domain selective flush if no PSI support or the size is +* too big. * PSI requires page size to be 2 ^ x, and the base address is naturally * aligned to the size */ - mask = ilog2(__roundup_pow_of_two(pages)); - /* Fallback to domain selective flush if size is too big */ - if (mask cap_max_amask_val(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, non_present_entry_flush); - - return iommu-flush.flush_iotlb(iommu, did, addr, mask, - DMA_TLB_PSI_FLUSH, - non_present_entry_flush); + if (!cap_pgsel_inv(iommu-cap) || mask cap_max_amask_val(iommu-cap)) + rc = iommu-flush.flush_iotlb(iommu, did, 0, 0, + DMA_TLB_DSI_FLUSH, + non_present_entry_flush); + else + rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, + DMA_TLB_PSI_FLUSH, + non_present_entry_flush); + return rc; } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) @@ -2293,15 +2290,16 @@ static void flush_unmaps(void) if (!iommu) continue; - if (deferred_flush[i].next) { - iommu-flush.flush_iotlb(iommu, 0, 0, 0, -DMA_TLB_GLOBAL_FLUSH, 0); - for (j = 0; j deferred_flush[i].next; j++) { - __free_iova(deferred_flush[i].domain[j]-iovad, - deferred_flush[i].iova[j]); - } - deferred_flush[i].next = 0; + if (!deferred_flush[i].next) + continue; + + iommu-flush.flush_iotlb(iommu, 0, 0, 0, +DMA_TLB_GLOBAL_FLUSH, 0); + for (j = 0; j deferred_flush[i].next; j++) { + __free_iova(deferred_flush[i].domain[j]-iovad, + deferred_flush[i].iova[j]); } + deferred_flush[i].next = 0; } list_size = 0; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] ATS capability support for Intel IOMMU
This patch series implements Address Translation Service support for the Intel IOMMU. ATS provides ability for the PCI Endpoint to request the DMA address translation from the IOMMU and cache the translation in the Endpoint to alleviate IOMMU pressure and improve the hardware performance in the I/O virtualization environment. [PATCH 1/6] PCI: support the ATS capability [PATCH 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure [PATCH 3/6] VT-d: add queue invalidation fault status support [PATCH 4/6] VT-d: add device IOTLB invalidation support [PATCH 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps [PATCH 6/6] VT-d: support the device IOTLB -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] PCI: support the ATS capability
The ATS spec can be found at http://www.pcisig.com/specifications/iov/ats/ (it requires membership). Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/pci.c| 68 ++ include/linux/pci.h | 15 ++ include/linux/pci_regs.h | 10 +++ 3 files changed, 93 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 061d1ee..5abab14 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1337,6 +1337,74 @@ void pci_enable_ari(struct pci_dev *dev) bridge-ari_enabled = 1; } +/** + * pci_enable_ats - enable the ATS capability + * @dev: the PCI device + * @ps: the IOMMU page shift + * + * Returns 0 on success, or a negative value on error. + */ +int pci_enable_ats(struct pci_dev *dev, int ps) +{ + int pos; + u16 ctrl; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return -ENODEV; + + if (ps PCI_ATS_MIN_STU) + return -EINVAL; + + ctrl = PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU) | PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl); + + dev-ats_enabled = 1; + + return 0; +} + +/** + * pci_disable_ats - disable the ATS capability + * @dev: the PCI device + */ +void pci_disable_ats(struct pci_dev *dev) +{ + int pos; + u16 ctrl; + + if (!dev-ats_enabled) + return; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return; + + pci_read_config_word(dev, pos + PCI_ATS_CTRL, ctrl); + ctrl = ~PCI_ATS_CTRL_ENABLE; + pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl); +} + +/** + * pci_ats_qdep - query ATS Invalidate Queue Depth + * @dev: the PCI device + * + * Returns the queue depth on success, or 0 on error. + */ +int pci_ats_qdep(struct pci_dev *dev) +{ + int pos; + u16 cap; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS); + if (!pos) + return 0; + + pci_read_config_word(dev, pos + PCI_ATS_CAP, cap); + + return PCI_ATS_CAP_QDEP(cap) ? : PCI_ATS_MAX_QDEP; +} + int pci_get_interrupt_pin(struct pci_dev *dev, struct pci_dev **bridge) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 4bb156b..e6a1b5a 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -227,6 +227,7 @@ struct pci_dev { unsigned intmsi_enabled:1; unsigned intmsix_enabled:1; unsigned intari_enabled:1; /* ARI forwarding */ + unsigned intats_enabled:1; /* Address Translation Service */ unsigned intis_managed:1; unsigned intis_pcie:1; pci_dev_flags_t dev_flags; @@ -1155,5 +1156,19 @@ static inline void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar) } #endif +extern int pci_enable_ats(struct pci_dev *dev, int ps); +extern void pci_disable_ats(struct pci_dev *dev); +extern int pci_ats_qdep(struct pci_dev *dev); +/** + * pci_ats_enabled - query the ATS status + * @dev: the PCI device + * + * Returns 1 if ATS capability is enabled, or 0 if not. + */ +static inline int pci_ats_enabled(struct pci_dev *dev) +{ + return dev-ats_enabled; +} + #endif /* __KERNEL__ */ #endif /* LINUX_PCI_H */ diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index e5effd4..00c9db5 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -436,6 +436,7 @@ #define PCI_EXT_CAP_ID_DSN 3 #define PCI_EXT_CAP_ID_PWR 4 #define PCI_EXT_CAP_ID_ARI 14 +#define PCI_EXT_CAP_ID_ATS 15 /* Advanced Error Reporting */ #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */ @@ -553,4 +554,13 @@ #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */ #define PCI_ARI_CTRL_FG(x)(((x) 4) 7) /* Function Group */ +/* Address Translation Service */ +#define PCI_ATS_CAP0x04/* ATS Capability Register */ +#define PCI_ATS_CAP_QDEP(x) ((x) 0x1f)/* Invalidate Queue Depth */ +#define PCI_ATS_MAX_QDEP 32 /* Max Invalidate Queue Depth */ +#define PCI_ATS_CTRL 0x06/* ATS Control Register */ +#define PCI_ATS_CTRL_ENABLE 0x8000 /* ATS Enable */ +#define PCI_ATS_CTRL_STU(x) ((x) 0x1f)/* Smallest Translation Unit */ +#define PCI_ATS_MIN_STU 12 /* shift of minimum STU block */ + #endif /* LINUX_PCI_REGS_H */ -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
Parse the Root Port ATS Capability Reporting Structure in DMA Remapping Reporting Structure ACPI table. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 114 -- include/linux/dmar.h|9 +++ include/linux/intel-iommu.h |1 + 3 files changed, 118 insertions(+), 6 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f5a662a..f2859d1 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -254,6 +254,86 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru) } return ret; } + +LIST_HEAD(dmar_atsr_units); + +static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr) +{ + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + atsr = container_of(hdr, struct acpi_dmar_atsr, header); + atsru = kzalloc(sizeof(*atsru), GFP_KERNEL); + if (!atsru) + return -ENOMEM; + + atsru-hdr = hdr; + atsru-include_all = atsr-flags 0x1; + + if (atsru-include_all) + list_add_tail(atsru-list, dmar_atsr_units); + else + list_add(atsru-list, dmar_atsr_units); + + return 0; +} + +static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru) +{ + int ret = 0; + struct acpi_dmar_atsr *atsr; + + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (!atsru-include_all) + ret = dmar_parse_dev_scope((void *)(atsr + 1), + (void *)atsr + atsr-header.length, + atsru-devices_cnt, atsru-devices, + atsr-segment); + + if (ret || !(atsru-include_all || atsru-devices_cnt)) { + list_del(atsru-list); + kfree(atsru); + } + + return ret; +} + +int dmar_find_matched_atsr_unit(struct pci_dev *dev) +{ + int i; + struct pci_bus *bus; + struct acpi_dmar_atsr *atsr; + struct dmar_atsr_unit *atsru; + + list_for_each_entry(atsru, dmar_atsr_units, list) { + atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header); + if (atsr-segment == pci_domain_nr(dev-bus)) + goto found; + } + + return 0; + +found: + for (bus = dev-bus; bus; bus = bus-parent) { + struct pci_dev *bridge = bus-self; + + if (!bridge || !bridge-is_pcie || + bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE) + return 0; + + if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) { + for (i = 0; i atsru-devices_cnt; i++) + if (atsru-devices[i] == bridge) + return 1; + break; + } + } + + if (atsru-include_all) + return 1; + + return 0; +} #endif static void __init @@ -261,22 +341,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) { struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_reserved_memory *rmrr; + struct acpi_dmar_atsr *atsr; switch (header-type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: - drhd = (struct acpi_dmar_hardware_unit *)header; + drhd = container_of(header, struct acpi_dmar_hardware_unit, + header); printk (KERN_INFO PREFIX - DRHD (flags: 0x%08x)base: 0x%016Lx\n, - drhd-flags, (unsigned long long)drhd-address); + DRHD base: %#016Lx flags: %#x\n, + (unsigned long long)drhd-address, drhd-flags); break; case ACPI_DMAR_TYPE_RESERVED_MEMORY: - rmrr = (struct acpi_dmar_reserved_memory *)header; - + rmrr = container_of(header, struct acpi_dmar_reserved_memory, + header); printk (KERN_INFO PREFIX - RMRR base: 0x%016Lx end: 0x%016Lx\n, + RMRR base: %#016Lx end: %#016Lx\n, (unsigned long long)rmrr-base_address, (unsigned long long)rmrr-end_address); break; + case ACPI_DMAR_TYPE_ATSR: + atsr = container_of(header, struct acpi_dmar_atsr, header); + printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags); + break; } } @@ -341,6 +427,11 @@ parse_dmar_table(void) ret = dmar_parse_one_rmrr(entry_header); #endif break; + case ACPI_DMAR_TYPE_ATSR: +#ifdef CONFIG_DMAR + ret = dmar_parse_one_atsr(entry_header); +#endif + break; default: printk(KERN_WARNING PREFIX Unknown DMAR structure type\n
[PATCH 3/6] VT-d: add queue invalidation fault status support
Check fault register after submitting an queue invalidation request. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 59 +++-- drivers/pci/intr_remapping.c | 21 -- include/linux/intel-iommu.h |4 ++- 3 files changed, 59 insertions(+), 25 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f2859d1..eb77258 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -673,19 +673,49 @@ static inline void reclaim_free_desc(struct q_inval *qi) } } +static int qi_check_fault(struct intel_iommu *iommu, int index) +{ + u32 fault; + int head; + struct q_inval *qi = iommu-qi; + int wait_index = (index + 1) % QI_LENGTH; + + fault = readl(iommu-reg + DMAR_FSTS_REG); + + /* +* If IQE happens, the head points to the descriptor associated +* with the error. No new descriptors are fetched until the IQE +* is cleared. +*/ + if (fault DMA_FSTS_IQE) { + head = readl(iommu-reg + DMAR_IQH_REG); + if ((head DMAR_IQ_OFFSET) == index) { + memcpy(qi-desc[index], qi-desc[wait_index], + sizeof(struct qi_desc)); + __iommu_flush_cache(iommu, qi-desc[index], + sizeof(struct qi_desc)); + writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG); + return -EINVAL; + } + } + + return 0; +} + /* * Submit the queued invalidation descriptor to the remapping * hardware unit and wait for its completion. */ -void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) +int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { + int rc = 0; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; unsigned long flags; if (!qi) - return; + return 0; hw = qi-desc; @@ -703,7 +733,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw[index] = *desc; - wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | QI_IWD_TYPE; + wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) | + QI_IWD_STATUS_WRITE | QI_IWD_TYPE; wait_desc.high = virt_to_phys(qi-desc_status[wait_index]); hw[wait_index] = wait_desc; @@ -714,13 +745,11 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) qi-free_head = (qi-free_head + 2) % QI_LENGTH; qi-free_cnt -= 2; - spin_lock(iommu-register_lock); /* * update the HW tail register indicating the presence of * new descriptors. */ - writel(qi-free_head 4, iommu-reg + DMAR_IQT_REG); - spin_unlock(iommu-register_lock); + writel(qi-free_head DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG); while (qi-desc_status[wait_index] != QI_DONE) { /* @@ -730,6 +759,10 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ + rc = qi_check_fault(iommu, index); + if (rc) + break; + spin_unlock(qi-q_lock); cpu_relax(); spin_lock(qi-q_lock); @@ -739,6 +772,8 @@ void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + + return rc; } /* @@ -751,13 +786,13 @@ void qi_global_iec(struct intel_iommu *iommu) desc.low = QI_IEC_TYPE; desc.high = 0; + /* should never fail */ qi_submit_sync(desc, iommu); } int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type, int non_present_entry_flush) { - struct qi_desc desc; if (non_present_entry_flush) { @@ -771,10 +806,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, | QI_CC_GRAN(type) | QI_CC_TYPE; desc.high = 0; - qi_submit_sync(desc, iommu); - - return 0; - + return qi_submit_sync(desc, iommu); } int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, @@ -804,10 +836,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih) | QI_IOTLB_AM(size_order); - qi_submit_sync(desc, iommu); - - return 0; - + return qi_submit_sync(desc, iommu); } /* diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c index f78371b..45effc5 100644 --- a/drivers/pci/intr_remapping.c +++ b/drivers/pci/intr_remapping.c
[PATCH 4/6] VT-d: add device IOTLB invalidation support
Support device IOTLB invalidation to flush the translation cached in the Endpoint. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/dmar.c | 63 -- include/linux/intel-iommu.h | 13 - 2 files changed, 72 insertions(+), 4 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index eb77258..88f6b1f 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -666,7 +666,8 @@ void free_iommu(struct intel_iommu *iommu) */ static inline void reclaim_free_desc(struct q_inval *qi) { - while (qi-desc_status[qi-free_tail] == QI_DONE) { + while (qi-desc_status[qi-free_tail] == QI_DONE || + qi-desc_status[qi-free_tail] == QI_ABORT) { qi-desc_status[qi-free_tail] = QI_FREE; qi-free_tail = (qi-free_tail + 1) % QI_LENGTH; qi-free_cnt++; @@ -676,10 +677,13 @@ static inline void reclaim_free_desc(struct q_inval *qi) static int qi_check_fault(struct intel_iommu *iommu, int index) { u32 fault; - int head; + int head, tail; struct q_inval *qi = iommu-qi; int wait_index = (index + 1) % QI_LENGTH; + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + fault = readl(iommu-reg + DMAR_FSTS_REG); /* @@ -699,6 +703,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) } } + /* +* If ITE happens, all pending wait_desc commands are aborted. +* No new descriptors are fetched until the ITE is cleared. +*/ + if (fault DMA_FSTS_ITE) { + head = readl(iommu-reg + DMAR_IQH_REG); + head = ((head DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + head |= 1; + tail = readl(iommu-reg + DMAR_IQT_REG); + tail = ((tail DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH; + + writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG); + + do { + if (qi-desc_status[head] == QI_IN_USE) + qi-desc_status[head] = QI_ABORT; + head = (head - 2 + QI_LENGTH) % QI_LENGTH; + } while (head != tail); + + if (qi-desc_status[wait_index] == QI_ABORT) + return -EAGAIN; + } + + if (fault DMA_FSTS_ICE) + writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG); + return 0; } @@ -708,7 +738,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index) */ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) { - int rc = 0; + int rc; struct q_inval *qi = iommu-qi; struct qi_desc *hw, wait_desc; int wait_index, index; @@ -719,6 +749,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) hw = qi-desc; +restart: + rc = 0; + spin_lock_irqsave(qi-q_lock, flags); while (qi-free_cnt 3) { spin_unlock_irqrestore(qi-q_lock, flags); @@ -773,6 +806,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu) reclaim_free_desc(qi); spin_unlock_irqrestore(qi-q_lock, flags); + if (rc == -EAGAIN) + goto restart; + return rc; } @@ -839,6 +875,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, return qi_submit_sync(desc, iommu); } +int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, int qdep, + u64 addr, unsigned int mask) +{ + struct qi_desc desc; + + if (mask) { + BUG_ON(addr ((1 (VTD_PAGE_SHIFT + mask)) - 1)); + addr |= (1 (VTD_PAGE_SHIFT + mask - 1)) - 1; + desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE; + } else + desc.high = QI_DEV_IOTLB_ADDR(addr); + + if (qdep = QI_DEV_IOTLB_MAX_INVS) + qdep = 0; + + desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) | + QI_DIOTLB_TYPE; + + return qi_submit_sync(desc, iommu); +} + /* * Enable Queued Invalidation interface. This is a must to support * interrupt-remapping. Also used by DMA-remapping, which replaces diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 0a220c9..d82bdac 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val) #define DMA_FSTS_PPF ((u32)2) #define DMA_FSTS_PFO ((u32)1) #define DMA_FSTS_IQE (1 4) +#define DMA_FSTS_ICE (1 5) +#define DMA_FSTS_ITE (1 6) #define dma_fsts_fault_record_index(s) (((s) 8) 0xff) /* FRCD_REG, 32 bits access */ @@ -224,7 +226,8 @@ do { \ enum { QI_FREE, QI_IN_USE, - QI_DONE + QI_DONE, + QI_ABORT }; #define
[PATCH 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 46 +--- 1 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 235fb7a..261b6bd 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -916,30 +916,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { - unsigned int mask; + int rc; + unsigned int mask = ilog2(__roundup_pow_of_two(pages)); BUG_ON(addr (~VTD_PAGE_MASK)); BUG_ON(pages == 0); - /* Fallback to domain selective flush if no PSI support */ - if (!cap_pgsel_inv(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, - non_present_entry_flush); - /* +* Fallback to domain selective flush if no PSI support or the size is +* too big. * PSI requires page size to be 2 ^ x, and the base address is naturally * aligned to the size */ - mask = ilog2(__roundup_pow_of_two(pages)); - /* Fallback to domain selective flush if size is too big */ - if (mask cap_max_amask_val(iommu-cap)) - return iommu-flush.flush_iotlb(iommu, did, 0, 0, - DMA_TLB_DSI_FLUSH, non_present_entry_flush); - - return iommu-flush.flush_iotlb(iommu, did, addr, mask, - DMA_TLB_PSI_FLUSH, - non_present_entry_flush); + if (!cap_pgsel_inv(iommu-cap) || mask cap_max_amask_val(iommu-cap)) + rc = iommu-flush.flush_iotlb(iommu, did, 0, 0, + DMA_TLB_DSI_FLUSH, + non_present_entry_flush); + else + rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, + DMA_TLB_PSI_FLUSH, + non_present_entry_flush); + return rc; } static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu) @@ -2292,15 +2289,16 @@ static void flush_unmaps(void) if (!iommu) continue; - if (deferred_flush[i].next) { - iommu-flush.flush_iotlb(iommu, 0, 0, 0, -DMA_TLB_GLOBAL_FLUSH, 0); - for (j = 0; j deferred_flush[i].next; j++) { - __free_iova(deferred_flush[i].domain[j]-iovad, - deferred_flush[i].iova[j]); - } - deferred_flush[i].next = 0; + if (!deferred_flush[i].next) + continue; + + iommu-flush.flush_iotlb(iommu, 0, 0, 0, +DMA_TLB_GLOBAL_FLUSH, 0); + for (j = 0; j deferred_flush[i].next; j++) { + __free_iova(deferred_flush[i].domain[j]-iovad, + deferred_flush[i].iova[j]); } + deferred_flush[i].next = 0; } list_size = 0; -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] VT-d: support the device IOTLB
Support device IOTLB (i.e. ATS) for both native and KVM environments. Signed-off-by: Yu Zhao yu.z...@intel.com --- drivers/pci/intel-iommu.c | 97 +- include/linux/intel-iommu.h |1 + 2 files changed, 95 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 261b6bd..a7ff7cb 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct context_entry *context) } #define CONTEXT_TT_MULTI_LEVEL 0 +#define CONTEXT_TT_DEV_IOTLB 1 static inline void context_set_translation_type(struct context_entry *context, unsigned long value) @@ -240,6 +241,8 @@ struct device_domain_info { struct list_head global; /* link to global list */ u8 bus; /* PCI bus numer */ u8 devfn; /* PCI devfn number */ + int qdep; /* invalidate queue depth */ + struct intel_iommu *iommu; /* IOMMU used by this device */ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct dmar_domain *domain; /* pointer to domain */ }; @@ -913,6 +916,75 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, return 0; } +static struct device_domain_info * +iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn) +{ + int found = 0; + unsigned long flags; + struct device_domain_info *info; + struct intel_iommu *iommu = device_to_iommu(bus, devfn); + + if (!ecap_dev_iotlb_support(iommu-ecap)) + return NULL; + + if (!iommu-qi) + return NULL; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) + if (info-dev info-bus == bus info-devfn == devfn) { + found = 1; + break; + } + spin_unlock_irqrestore(device_domain_lock, flags); + + if (!found) + return NULL; + + if (!dmar_find_matched_atsr_unit(info-dev)) + return NULL; + + info-iommu = iommu; + info-qdep = pci_ats_qdep(info-dev); + if (!info-qdep) + return NULL; + + return info; +} + +static void iommu_enable_dev_iotlb(struct device_domain_info *info) +{ + pci_enable_ats(info-dev, VTD_PAGE_SHIFT); +} + +static void iommu_disable_dev_iotlb(struct device_domain_info *info) +{ + if (info-dev pci_ats_enabled(info-dev)) + pci_disable_ats(info-dev); +} + +static void iommu_flush_dev_iotlb(struct dmar_domain *domain, + u64 addr, unsigned int mask) +{ + int rc; + u16 sid; + unsigned long flags; + struct device_domain_info *info; + + spin_lock_irqsave(device_domain_lock, flags); + list_for_each_entry(info, domain-devices, link) { + if (!info-dev || !pci_ats_enabled(info-dev)) + continue; + + sid = info-bus 8 | info-devfn; + rc = qi_flush_dev_iotlb(info-iommu, sid, + info-qdep, addr, mask); + if (rc) + printk(KERN_ERR IOMMU: flush device IOTLB failed\n); + } + spin_unlock_irqrestore(device_domain_lock, flags); +} + static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int pages, int non_present_entry_flush) { @@ -936,6 +1008,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, rc = iommu-flush.flush_iotlb(iommu, did, addr, mask, DMA_TLB_PSI_FLUSH, non_present_entry_flush); + if (!rc !non_present_entry_flush) + iommu_flush_dev_iotlb(iommu-domains[did], addr, mask); + return rc; } @@ -1460,6 +1535,7 @@ static int domain_context_mapping_one(struct dmar_domain *domain, unsigned long ndomains; int id; int agaw; + struct device_domain_info *info; pr_debug(Set context mapping for %02x:%02x.%d\n, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); @@ -1525,7 +1601,11 @@ static int domain_context_mapping_one(struct dmar_domain *domain, context_set_domain_id(context, id); context_set_address_width(context, iommu-agaw); context_set_address_root(context, virt_to_phys(pgd)); - context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); + info = iommu_support_dev_iotlb(domain, bus, devfn); + if (info) + context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB); + else + context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL); context_set_fault_enable(context); context_set_present(context
Re: [SR-IOV driver example 0/3] introduction
On Thu, Nov 27, 2008 at 04:14:48AM +0800, Jeff Garzik wrote: Yu Zhao wrote: SR-IOV drivers of Intel 82576 NIC are available. There are two parts of the drivers: Physical Function driver and Virtual Function driver. The PF driver is based on the IGB driver and is used to control PF to allocate hardware specific resources and interface with the SR-IOV core. The VF driver is a new NIC driver that is same as the traditional PCI device driver. It works in both the host and the guest (Xen and KVM) environment. These two drivers are testing versions and they are *only* intended to show how to use SR-IOV API. Intel 82576 NIC specification can be found at: http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core [SR-IOV driver example 3/3] VF driver tar ball Please copy [EMAIL PROTECTED] on all network-related patches. This is where the network developers live, and all patches on this list are automatically archived for review and handling at http://patchwork.ozlabs.org/project/netdev/list/ Will do. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
On Thu, Nov 27, 2008 at 12:58:59AM +0800, Greg KH wrote: On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote: + my_mac_addr[5] = (unsigned char)i; + igb_set_vf_mac(netdev, i, my_mac_addr); + igb_set_vf_vmolr(adapter, i); + } + } else + printk(KERN_INFO SR-IOV is disabled\n); Is that really true? (oh, use dev_info as well.) What happens if you had called this with 5 and then later with 0, you never destroyed those existing virtual functions, yet the code does: + adapter-vfs_allocated_count = nr_virtfn; Which makes the driver think they are not present. What happens when the driver later goes to shut down? Are those resources freed up properly? For now we hard-code the tx/rx queues allocation so this doesn't matter. Eventually this will become dynamic allocation: when number of VFs changes the corresponding resources need to be freed. I'll put more comments here. Thanks, Yu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SR-IOV driver example 0/3] introduction
On Thu, Nov 27, 2008 at 12:59:33AM +0800, Greg KH wrote: On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote: SR-IOV drivers of Intel 82576 NIC are available. There are two parts of the drivers: Physical Function driver and Virtual Function driver. The PF driver is based on the IGB driver and is used to control PF to allocate hardware specific resources and interface with the SR-IOV core. The VF driver is a new NIC driver that is same as the traditional PCI device driver. It works in both the host and the guest (Xen and KVM) environment. These two drivers are testing versions and they are *only* intended to show how to use SR-IOV API. That's funny, as some distros are already shipping this driver. You might want to tell them that this is an example only driver and not to be used for real... :( Maybe they are shipping another version, not this one. This one is really a experimental patch, it's just created a week before... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[SR-IOV driver example 0/3 resend] introduction
SR-IOV drivers of Intel 82576 NIC are available. There are two parts of the drivers: Physical Function driver and Virtual Function driver. The PF driver is based on the IGB driver and is used to control PF to allocate hardware specific resources and interface with the SR-IOV core. The VF driver is a new NIC driver that is same as the traditional PCI device driver. It works in both the host and the guest (Xen and KVM) environment. These two drivers are testing versions and they are *only* intended to show how to use SR-IOV API. Intel 82576 NIC specification can be found at: http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf [SR-IOV driver example 0/3 resend] introduction [SR-IOV driver example 1/3 resend] PF driver: hardware specific operations [SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core [SR-IOV driver example 3/3 resend] VF driver: an independent PCI NIC driver -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html