from:"Yu Zhao"

Re: [PATCH v5 6/6] VT-d: support the device IOTLB

2009-06-28 Thread Yu Zhao


David Woodhouse wrote:

On Mon, 2009-05-18 at 13:51 +0800, Yu Zhao wrote:

@@ -965,6 +1037,8 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
else
iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH);
+   if (did)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)


Hm, why 'if (did)' ? 


Domain ID zero is only special in caching mode. Should it be:
if (!cap_caching_mode(iommu-cap) || did)
?


Yes, you are right. Domain ID 0 is only reserved for caching mode. Will 
send a fix for this.


Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 resend 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-05-17 Thread Yu Zhao


David Woodhouse wrote:

On Thu, 2009-05-14 at 10:32 +0800, Yu Zhao wrote:

Make iommu_flush_iotlb_psi() and flush_unmaps() more readable.


This doesn't apply any more.



Sorry, I'll rebase those patches and post them again.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 0/6] ATS capability support for Intel IOMMU

2009-05-17 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. The PCIe Endpoint that supports ATS capability can
request the DMA address translation from the IOMMU and cache the
translation itself. This can alleviate IOMMU TLB pressure and improve
the hardware performance in the I/O virtualization environment.

The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The
spec can be found at: http://www.pcisig.com/specifications/iov/ats/
(it requires membership).

Changelog:
v4 - v5
  1, rebase to the latest IOMMU tree
v3 - v4
  1, coding style fixes (Grant Grundler)
  2, support the Virtual Function ATS capability

v2 - v3
  1, throw error message if VT-d hardware detects invalid descriptor
 on Queued Invalidation interface (David Woodhouse)
  2, avoid using pci_find_ext_capability every time when reading ATS
 Invalidate Queue Depth (Matthew Wilcox)

v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)

Yu Zhao (6):
  PCI: support the ATS capability
  PCI: handle Virtual Function ATS enabling
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c|  189 ++---
 drivers/pci/intel-iommu.c |  141 +--
 drivers/pci/iov.c |  155 --
 drivers/pci/pci.h |   39 +
 include/linux/dma_remapping.h |1 +
 include/linux/dmar.h  |9 ++
 include/linux/intel-iommu.h   |   16 +++-
 include/linux/pci.h   |2 +
 include/linux/pci_regs.h  |   10 ++
 9 files changed, 514 insertions(+), 48 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 1/6] PCI: support the ATS capability

2009-05-17 Thread Yu Zhao

The PCIe ATS capability makes the Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation
in the device side, thus alleviate IOMMU pressure and improve the
hardware performance in the I/O virtualization environment.

Signed-off-by: Yu Zhao yu.z...@intel.com
Acked-by: Jesse Barnes jbar...@virtuousgeek.org
---
 drivers/pci/iov.c|  105 ++
 drivers/pci/pci.h|   37 
 include/linux/pci.h  |2 +
 include/linux/pci_regs.h |   10 
 4 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b497daa..0a7a1b4 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -5,6 +5,7 @@
  *
  * PCI Express I/O Virtualization (IOV) support.
  *   Single Root IOV 1.0
+ *   Address Translation Service 1.0
  */
 
 #include linux/pci.h
@@ -679,3 +680,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev)
return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_migration);
+
+static int ats_alloc_one(struct pci_dev *dev, int ps)
+{
+   int pos;
+   u16 cap;
+   struct pci_ats *ats;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   ats = kzalloc(sizeof(*ats), GFP_KERNEL);
+   if (!ats)
+   return -ENOMEM;
+
+   ats-pos = pos;
+   ats-stu = ps;
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+   ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+   PCI_ATS_MAX_QDEP;
+   dev-ats = ats;
+
+   return 0;
+}
+
+static void ats_free_one(struct pci_dev *dev)
+{
+   kfree(dev-ats);
+   dev-ats = NULL;
+}
+
+/**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @ps: the IOMMU page shift
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_ats(struct pci_dev *dev, int ps)
+{
+   int rc;
+   u16 ctrl;
+
+   BUG_ON(dev-ats);
+
+   if (ps  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+
+   ctrl = PCI_ATS_CTRL_ENABLE;
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   BUG_ON(!dev-ats);
+
+   pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   ats_free_one(dev);
+}
+
+/**
+ * pci_ats_queue_depth - query the ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or negative on failure.
+ *
+ * The ATS spec uses 0 in the Invalidate Queue Depth field to
+ * indicate that the function can accept 32 Invalidate Request.
+ * But here we use the `real' values (i.e. 1~32) for the Queue
+ * Depth.
+ */
+int pci_ats_queue_depth(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   if (dev-ats)
+   return dev-ats-qdep;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+  PCI_ATS_MAX_QDEP;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d03f6b9..3c2ec64 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -229,6 +229,13 @@ struct pci_sriov {
u8 __iomem *mstate; /* VF Migration State Array */
 };
 
+/* Address Translation Service */
+struct pci_ats {
+   int pos;/* capability position */
+   int stu;/* Smallest Translation Unit */
+   int qdep;   /* Invalidate Queue Depth */
+};
+
 #ifdef CONFIG_PCI_IOV
 extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
@@ -236,6 +243,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int 
resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
 extern int pci_iov_bus_range(struct pci_bus *bus);
+
+extern int pci_enable_ats(struct pci_dev *dev, int ps);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_queue_depth(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return !!dev-ats;
+}
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -257,6 +278,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus

[PATCH v5 2/6] PCI: handle Virtual Function ATS enabling

2009-05-17 Thread Yu Zhao

The SR-IOV spec requires that the Smallest Translation Unit and
the Invalidate Queue Depth fields in the Virtual Function ATS
capability are hardwired to 0. If a function is a Virtual Function,
then and set its Physical Function's STU before enabling the ATS.

Signed-off-by: Yu Zhao yu.z...@intel.com
Acked-by: Jesse Barnes jbar...@virtuousgeek.org
---
 drivers/pci/iov.c |   66 +---
 drivers/pci/pci.h |4 ++-
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0a7a1b4..4151404 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -491,10 +491,10 @@ found:
 
if (pdev)
iov-dev = pci_dev_get(pdev);
-   else {
+   else
iov-dev = dev;
-   mutex_init(iov-lock);
-   }
+
+   mutex_init(iov-lock);
 
dev-sriov = iov;
dev-is_physfn = 1;
@@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev)
 {
BUG_ON(dev-sriov-nr_virtfn);
 
-   if (dev == dev-sriov-dev)
-   mutex_destroy(dev-sriov-lock);
-   else
+   if (dev != dev-sriov-dev)
pci_dev_put(dev-sriov-dev);
 
+   mutex_destroy(dev-sriov-lock);
+
kfree(dev-sriov);
dev-sriov = NULL;
 }
@@ -723,19 +723,40 @@ int pci_enable_ats(struct pci_dev *dev, int ps)
int rc;
u16 ctrl;
 
-   BUG_ON(dev-ats);
+   BUG_ON(dev-ats  dev-ats-is_enabled);
 
if (ps  PCI_ATS_MIN_STU)
return -EINVAL;
 
-   rc = ats_alloc_one(dev, ps);
-   if (rc)
-   return rc;
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   if (pdev-ats)
+   rc = pdev-ats-stu == ps ? 0 : -EINVAL;
+   else
+   rc = ats_alloc_one(pdev, ps);
+
+   if (!rc)
+   pdev-ats-ref_cnt++;
+   mutex_unlock(pdev-sriov-lock);
+   if (rc)
+   return rc;
+   }
+
+   if (!dev-is_physfn) {
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+   }
 
ctrl = PCI_ATS_CTRL_ENABLE;
-   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   if (!dev-is_virtfn)
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
+   dev-ats-is_enabled = 1;
+
return 0;
 }
 
@@ -747,13 +768,26 @@ void pci_disable_ats(struct pci_dev *dev)
 {
u16 ctrl;
 
-   BUG_ON(!dev-ats);
+   BUG_ON(!dev-ats || !dev-ats-is_enabled);
 
pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
ctrl = ~PCI_ATS_CTRL_ENABLE;
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
-   ats_free_one(dev);
+   dev-ats-is_enabled = 0;
+
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   pdev-ats-ref_cnt--;
+   if (!pdev-ats-ref_cnt)
+   ats_free_one(pdev);
+   mutex_unlock(pdev-sriov-lock);
+   }
+
+   if (!dev-is_physfn)
+   ats_free_one(dev);
 }
 
 /**
@@ -765,13 +799,17 @@ void pci_disable_ats(struct pci_dev *dev)
  * The ATS spec uses 0 in the Invalidate Queue Depth field to
  * indicate that the function can accept 32 Invalidate Request.
  * But here we use the `real' values (i.e. 1~32) for the Queue
- * Depth.
+ * Depth; and 0 indicates the function shares the Queue with
+ * other functions (doesn't exclusively own a Queue).
  */
 int pci_ats_queue_depth(struct pci_dev *dev)
 {
int pos;
u16 cap;
 
+   if (dev-is_virtfn)
+   return 0;
+
if (dev-ats)
return dev-ats-qdep;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3c2ec64..f73bcbe 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -234,6 +234,8 @@ struct pci_ats {
int pos;/* capability position */
int stu;/* Smallest Translation Unit */
int qdep;   /* Invalidate Queue Depth */
+   int ref_cnt;/* Physical Function reference count */
+   int is_enabled:1;   /* Enable bit is set */
 };
 
 #ifdef CONFIG_PCI_IOV
@@ -255,7 +257,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev);
  */
 static inline int pci_ats_enabled(struct pci_dev *dev)
 {
-   return !!dev-ats;
+   return dev-ats  dev-ats-is_enabled;
 }
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-05-17 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in the DMA
Remapping Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f23460a..6d7f961 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -267,6 +267,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -274,22 +352,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -363,6 +447,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -431,11 +520,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct

[PATCH v5 4/6] VT-d: add device IOTLB invalidation support

2009-05-17 Thread Yu Zhao

Support device IOTLB invalidation to flush the translation cached
in the Endpoint.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |   77 ++
 include/linux/intel-iommu.h |   14 +++-
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index 6d7f961..7b287cb 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -699,7 +699,8 @@ void free_iommu(struct intel_iommu *iommu)
  */
 static inline void reclaim_free_desc(struct q_inval *qi)
 {
-   while (qi-desc_status[qi-free_tail] == QI_DONE) {
+   while (qi-desc_status[qi-free_tail] == QI_DONE ||
+  qi-desc_status[qi-free_tail] == QI_ABORT) {
qi-desc_status[qi-free_tail] = QI_FREE;
qi-free_tail = (qi-free_tail + 1) % QI_LENGTH;
qi-free_cnt++;
@@ -709,10 +710,13 @@ static inline void reclaim_free_desc(struct q_inval *qi)
 static int qi_check_fault(struct intel_iommu *iommu, int index)
 {
u32 fault;
-   int head;
+   int head, tail;
struct q_inval *qi = iommu-qi;
int wait_index = (index + 1) % QI_LENGTH;
 
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+
fault = readl(iommu-reg + DMAR_FSTS_REG);
 
/*
@@ -722,7 +726,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
 */
if (fault  DMA_FSTS_IQE) {
head = readl(iommu-reg + DMAR_IQH_REG);
-   if ((head  4) == index) {
+   if ((head  DMAR_IQ_SHIFT) == index) {
+   printk(KERN_ERR VT-d detected invalid descriptor: 
+   low=%llx, high=%llx\n,
+   (unsigned long long)qi-desc[index].low,
+   (unsigned long long)qi-desc[index].high);
memcpy(qi-desc[index], qi-desc[wait_index],
sizeof(struct qi_desc));
__iommu_flush_cache(iommu, qi-desc[index],
@@ -732,6 +740,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
}
}
 
+   /*
+* If ITE happens, all pending wait_desc commands are aborted.
+* No new descriptors are fetched until the ITE is cleared.
+*/
+   if (fault  DMA_FSTS_ITE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   head = ((head  DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH;
+   head |= 1;
+   tail = readl(iommu-reg + DMAR_IQT_REG);
+   tail = ((tail  DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH;
+
+   writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG);
+
+   do {
+   if (qi-desc_status[head] == QI_IN_USE)
+   qi-desc_status[head] = QI_ABORT;
+   head = (head - 2 + QI_LENGTH) % QI_LENGTH;
+   } while (head != tail);
+
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+   }
+
+   if (fault  DMA_FSTS_ICE)
+   writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG);
+
return 0;
 }
 
@@ -741,7 +775,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
  */
 int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
-   int rc = 0;
+   int rc;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
@@ -752,6 +786,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 
hw = qi-desc;
 
+restart:
+   rc = 0;
+
spin_lock_irqsave(qi-q_lock, flags);
while (qi-free_cnt  3) {
spin_unlock_irqrestore(qi-q_lock, flags);
@@ -782,7 +819,7 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 * update the HW tail register indicating the presence of
 * new descriptors.
 */
-   writel(qi-free_head  4, iommu-reg + DMAR_IQT_REG);
+   writel(qi-free_head  DMAR_IQ_SHIFT, iommu-reg + DMAR_IQT_REG);
 
while (qi-desc_status[wait_index] != QI_DONE) {
/*
@@ -794,18 +831,21 @@ int qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 */
rc = qi_check_fault(iommu, index);
if (rc)
-   goto out;
+   break;
 
spin_unlock(qi-q_lock);
cpu_relax();
spin_lock(qi-q_lock);
}
-out:
-   qi-desc_status[index] = qi-desc_status[wait_index] = QI_DONE;
+
+   qi-desc_status[index] = QI_DONE;
 
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
 
+   if (rc == -EAGAIN)
+   goto restart;
+
return rc;
 }
 
@@ -857,6 +897,27 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, 
u64 addr

[PATCH v5 6/6] VT-d: support the device IOTLB

2009-05-17 Thread Yu Zhao

Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM
environments.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c |  109 +---
 include/linux/dma_remapping.h |1 +
 include/linux/intel-iommu.h   |1 +
 3 files changed, 102 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 6d7cb84..c3cdfc9 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -252,6 +252,7 @@ struct device_domain_info {
u8 bus; /* PCI bus number */
u8 devfn;   /* PCI devfn number */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct dmar_domain *domain; /* pointer to domain */
 };
 
@@ -945,6 +946,77 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
(unsigned long long)DMA_TLB_IAIG(val));
 }
 
+static struct device_domain_info *iommu_support_dev_iotlb(
+   struct dmar_domain *domain, int segment, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(segment, bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found || !info-dev)
+   return NULL;
+
+   if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS))
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info)
+   return;
+
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   return;
+
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned mask)
+{
+   u16 sid, qdep;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   qdep = pci_ats_queue_depth(info-dev);
+   qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
  u64 addr, unsigned int pages)
 {
@@ -965,6 +1037,8 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
else
iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH);
+   if (did)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -1305,6 +1379,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain, int segment,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info = NULL;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1372,15 +1447,21 @@ static int domain_context_mapping_one(struct 
dmar_domain *domain, int segment,
 
context_set_domain_id(context, id);
 
+   if (translation != CONTEXT_TT_PASS_THROUGH) {
+   info = iommu_support_dev_iotlb(domain, segment, bus, devfn);
+   translation = info ? CONTEXT_TT_DEV_IOTLB :
+CONTEXT_TT_MULTI_LEVEL;
+   }
/*
 * In pass through mode, AW must be programmed to indicate the largest
 * AGAW value supported by hardware. And ASR is ignored by hardware.
 */
-   if (likely(translation == CONTEXT_TT_MULTI_LEVEL)) {
-   context_set_address_width(context, iommu-agaw);
-   context_set_address_root(context, virt_to_phys(pgd));
-   } else
+   if (unlikely(translation == CONTEXT_TT_PASS_THROUGH))
context_set_address_width(context, iommu-msagaw);
+   else {
+   context_set_address_root(context, virt_to_phys(pgd

[PATCH v4 resend 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-05-13 Thread Yu Zhao

Make iommu_flush_iotlb_psi() and flush_unmaps() more readable.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c |   46 +---
 1 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 001b328..a2cbc01 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -968,30 +968,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
-   unsigned int mask;
+   int rc;
+   unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 
BUG_ON(addr  (~VTD_PAGE_MASK));
BUG_ON(pages == 0);
 
-   /* Fallback to domain selective flush if no PSI support */
-   if (!cap_pgsel_inv(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH,
-   non_present_entry_flush);
-
/*
+* Fallback to domain selective flush if no PSI support or the size is
+* too big.
 * PSI requires page size to be 2 ^ x, and the base address is naturally
 * aligned to the size
 */
-   mask = ilog2(__roundup_pow_of_two(pages));
-   /* Fallback to domain selective flush if size is too big */
-   if (mask  cap_max_amask_val(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH, non_present_entry_flush);
-
-   return iommu-flush.flush_iotlb(iommu, did, addr, mask,
-   DMA_TLB_PSI_FLUSH,
-   non_present_entry_flush);
+   if (!cap_pgsel_inv(iommu-cap) || mask  cap_max_amask_val(iommu-cap))
+   rc = iommu-flush.flush_iotlb(iommu, did, 0, 0,
+   DMA_TLB_DSI_FLUSH,
+   non_present_entry_flush);
+   else
+   rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
+   DMA_TLB_PSI_FLUSH,
+   non_present_entry_flush);
+   return rc;
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2214,15 +2211,16 @@ static void flush_unmaps(void)
if (!iommu)
continue;
 
-   if (deferred_flush[i].next) {
-   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
-DMA_TLB_GLOBAL_FLUSH, 0);
-   for (j = 0; j  deferred_flush[i].next; j++) {
-   __free_iova(deferred_flush[i].domain[j]-iovad,
-   deferred_flush[i].iova[j]);
-   }
-   deferred_flush[i].next = 0;
+   if (!deferred_flush[i].next)
+   continue;
+
+   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
+DMA_TLB_GLOBAL_FLUSH, 0);
+   for (j = 0; j  deferred_flush[i].next; j++) {
+   __free_iova(deferred_flush[i].domain[j]-iovad,
+   deferred_flush[i].iova[j]);
}
+   deferred_flush[i].next = 0;
}
 
list_size = 0;
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 resend 0/6] ATS capability support for Intel IOMMU

2009-05-13 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. The PCIe Endpoint that supports ATS capability can
request the DMA address translation from the IOMMU and cache the
translation itself. This can alleviate IOMMU TLB pressure and improve
the hardware performance in the I/O virtualization environment.

The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The
spec can be found at: http://www.pcisig.com/specifications/iov/ats/
(it requires membership).


Changelog:
v3 - v4
  1, coding style fixes (Grant Grundler)
  2, support the Virtual Function ATS capability

v2 - v3
  1, throw error message if VT-d hardware detects invalid descriptor
 on Queued Invalidation interface (David Woodhouse)
  2, avoid using pci_find_ext_capability every time when reading ATS
 Invalidate Queue Depth (Matthew Wilcox)

v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)

Yu Zhao (6):
  PCI: support the ATS capability
  PCI: handle Virtual Function ATS enabling
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c  |  189 +++---
 drivers/pci/intel-iommu.c   |  140 ++--
 drivers/pci/iov.c   |  155 ++--
 drivers/pci/pci.h   |   39 +
 include/linux/dmar.h|9 ++
 include/linux/intel-iommu.h |   16 -
 include/linux/pci.h |2 +
 include/linux/pci_regs.h|   10 +++
 8 files changed, 515 insertions(+), 45 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 resend 1/6] PCI: support the ATS capability

2009-05-13 Thread Yu Zhao

The PCIe ATS capability makes the Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation
in the device side, thus alleviate IOMMU pressure and improve the
hardware performance in the I/O virtualization environment.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c|  105 ++
 drivers/pci/pci.h|   37 
 include/linux/pci.h  |2 +
 include/linux/pci_regs.h |   10 
 4 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b497daa..0a7a1b4 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -5,6 +5,7 @@
  *
  * PCI Express I/O Virtualization (IOV) support.
  *   Single Root IOV 1.0
+ *   Address Translation Service 1.0
  */
 
 #include linux/pci.h
@@ -679,3 +680,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev)
return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_migration);
+
+static int ats_alloc_one(struct pci_dev *dev, int ps)
+{
+   int pos;
+   u16 cap;
+   struct pci_ats *ats;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   ats = kzalloc(sizeof(*ats), GFP_KERNEL);
+   if (!ats)
+   return -ENOMEM;
+
+   ats-pos = pos;
+   ats-stu = ps;
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+   ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+   PCI_ATS_MAX_QDEP;
+   dev-ats = ats;
+
+   return 0;
+}
+
+static void ats_free_one(struct pci_dev *dev)
+{
+   kfree(dev-ats);
+   dev-ats = NULL;
+}
+
+/**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @ps: the IOMMU page shift
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_ats(struct pci_dev *dev, int ps)
+{
+   int rc;
+   u16 ctrl;
+
+   BUG_ON(dev-ats);
+
+   if (ps  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+
+   ctrl = PCI_ATS_CTRL_ENABLE;
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   BUG_ON(!dev-ats);
+
+   pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   ats_free_one(dev);
+}
+
+/**
+ * pci_ats_queue_depth - query the ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or negative on failure.
+ *
+ * The ATS spec uses 0 in the Invalidate Queue Depth field to
+ * indicate that the function can accept 32 Invalidate Request.
+ * But here we use the `real' values (i.e. 1~32) for the Queue
+ * Depth.
+ */
+int pci_ats_queue_depth(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   if (dev-ats)
+   return dev-ats-qdep;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+  PCI_ATS_MAX_QDEP;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d03f6b9..3c2ec64 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -229,6 +229,13 @@ struct pci_sriov {
u8 __iomem *mstate; /* VF Migration State Array */
 };
 
+/* Address Translation Service */
+struct pci_ats {
+   int pos;/* capability position */
+   int stu;/* Smallest Translation Unit */
+   int qdep;   /* Invalidate Queue Depth */
+};
+
 #ifdef CONFIG_PCI_IOV
 extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
@@ -236,6 +243,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int 
resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
 extern int pci_iov_bus_range(struct pci_bus *bus);
+
+extern int pci_enable_ats(struct pci_dev *dev, int ps);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_queue_depth(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return !!dev-ats;
+}
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -257,6 +278,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus)
 {
return 0;
 }
+
+static inline int

[PATCH v4 resend 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-05-13 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in the DMA
Remapping Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index fa3a113..eaa405f 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -267,6 +267,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -274,22 +352,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -363,6 +447,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -431,11 +520,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct

[PATCH v4 resend 2/6] PCI: handle Virtual Function ATS enabling

2009-05-13 Thread Yu Zhao

The SR-IOV spec requires that the Smallest Translation Unit and
the Invalidate Queue Depth fields in the Virtual Function ATS
capability are hardwired to 0. If a function is a Virtual Function,
then and set its Physical Function's STU before enabling the ATS.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   66 +---
 drivers/pci/pci.h |4 ++-
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0a7a1b4..4151404 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -491,10 +491,10 @@ found:
 
if (pdev)
iov-dev = pci_dev_get(pdev);
-   else {
+   else
iov-dev = dev;
-   mutex_init(iov-lock);
-   }
+
+   mutex_init(iov-lock);
 
dev-sriov = iov;
dev-is_physfn = 1;
@@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev)
 {
BUG_ON(dev-sriov-nr_virtfn);
 
-   if (dev == dev-sriov-dev)
-   mutex_destroy(dev-sriov-lock);
-   else
+   if (dev != dev-sriov-dev)
pci_dev_put(dev-sriov-dev);
 
+   mutex_destroy(dev-sriov-lock);
+
kfree(dev-sriov);
dev-sriov = NULL;
 }
@@ -723,19 +723,40 @@ int pci_enable_ats(struct pci_dev *dev, int ps)
int rc;
u16 ctrl;
 
-   BUG_ON(dev-ats);
+   BUG_ON(dev-ats  dev-ats-is_enabled);
 
if (ps  PCI_ATS_MIN_STU)
return -EINVAL;
 
-   rc = ats_alloc_one(dev, ps);
-   if (rc)
-   return rc;
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   if (pdev-ats)
+   rc = pdev-ats-stu == ps ? 0 : -EINVAL;
+   else
+   rc = ats_alloc_one(pdev, ps);
+
+   if (!rc)
+   pdev-ats-ref_cnt++;
+   mutex_unlock(pdev-sriov-lock);
+   if (rc)
+   return rc;
+   }
+
+   if (!dev-is_physfn) {
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+   }
 
ctrl = PCI_ATS_CTRL_ENABLE;
-   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   if (!dev-is_virtfn)
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
+   dev-ats-is_enabled = 1;
+
return 0;
 }
 
@@ -747,13 +768,26 @@ void pci_disable_ats(struct pci_dev *dev)
 {
u16 ctrl;
 
-   BUG_ON(!dev-ats);
+   BUG_ON(!dev-ats || !dev-ats-is_enabled);
 
pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
ctrl = ~PCI_ATS_CTRL_ENABLE;
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
-   ats_free_one(dev);
+   dev-ats-is_enabled = 0;
+
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   pdev-ats-ref_cnt--;
+   if (!pdev-ats-ref_cnt)
+   ats_free_one(pdev);
+   mutex_unlock(pdev-sriov-lock);
+   }
+
+   if (!dev-is_physfn)
+   ats_free_one(dev);
 }
 
 /**
@@ -765,13 +799,17 @@ void pci_disable_ats(struct pci_dev *dev)
  * The ATS spec uses 0 in the Invalidate Queue Depth field to
  * indicate that the function can accept 32 Invalidate Request.
  * But here we use the `real' values (i.e. 1~32) for the Queue
- * Depth.
+ * Depth; and 0 indicates the function shares the Queue with
+ * other functions (doesn't exclusively own a Queue).
  */
 int pci_ats_queue_depth(struct pci_dev *dev)
 {
int pos;
u16 cap;
 
+   if (dev-is_virtfn)
+   return 0;
+
if (dev-ats)
return dev-ats-qdep;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3c2ec64..f73bcbe 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -234,6 +234,8 @@ struct pci_ats {
int pos;/* capability position */
int stu;/* Smallest Translation Unit */
int qdep;   /* Invalidate Queue Depth */
+   int ref_cnt;/* Physical Function reference count */
+   int is_enabled:1;   /* Enable bit is set */
 };
 
 #ifdef CONFIG_PCI_IOV
@@ -255,7 +257,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev);
  */
 static inline int pci_ats_enabled(struct pci_dev *dev)
 {
-   return !!dev-ats;
+   return dev-ats  dev-ats-is_enabled;
 }
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 resend 4/6] VT-d: add device IOTLB invalidation support

2009-05-13 Thread Yu Zhao

Support device IOTLB invalidation to flush the translation cached
in the Endpoint.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |   77 ++
 include/linux/intel-iommu.h |   14 +++-
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index eaa405f..6afd804 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -690,7 +690,8 @@ void free_iommu(struct intel_iommu *iommu)
  */
 static inline void reclaim_free_desc(struct q_inval *qi)
 {
-   while (qi-desc_status[qi-free_tail] == QI_DONE) {
+   while (qi-desc_status[qi-free_tail] == QI_DONE ||
+  qi-desc_status[qi-free_tail] == QI_ABORT) {
qi-desc_status[qi-free_tail] = QI_FREE;
qi-free_tail = (qi-free_tail + 1) % QI_LENGTH;
qi-free_cnt++;
@@ -700,10 +701,13 @@ static inline void reclaim_free_desc(struct q_inval *qi)
 static int qi_check_fault(struct intel_iommu *iommu, int index)
 {
u32 fault;
-   int head;
+   int head, tail;
struct q_inval *qi = iommu-qi;
int wait_index = (index + 1) % QI_LENGTH;
 
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+
fault = readl(iommu-reg + DMAR_FSTS_REG);
 
/*
@@ -713,7 +717,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
 */
if (fault  DMA_FSTS_IQE) {
head = readl(iommu-reg + DMAR_IQH_REG);
-   if ((head  4) == index) {
+   if ((head  DMAR_IQ_SHIFT) == index) {
+   printk(KERN_ERR VT-d detected invalid descriptor: 
+   low=%llx, high=%llx\n,
+   (unsigned long long)qi-desc[index].low,
+   (unsigned long long)qi-desc[index].high);
memcpy(qi-desc[index], qi-desc[wait_index],
sizeof(struct qi_desc));
__iommu_flush_cache(iommu, qi-desc[index],
@@ -723,6 +731,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
}
}
 
+   /*
+* If ITE happens, all pending wait_desc commands are aborted.
+* No new descriptors are fetched until the ITE is cleared.
+*/
+   if (fault  DMA_FSTS_ITE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   head = ((head  DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH;
+   head |= 1;
+   tail = readl(iommu-reg + DMAR_IQT_REG);
+   tail = ((tail  DMAR_IQ_SHIFT) - 1 + QI_LENGTH) % QI_LENGTH;
+
+   writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG);
+
+   do {
+   if (qi-desc_status[head] == QI_IN_USE)
+   qi-desc_status[head] = QI_ABORT;
+   head = (head - 2 + QI_LENGTH) % QI_LENGTH;
+   } while (head != tail);
+
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+   }
+
+   if (fault  DMA_FSTS_ICE)
+   writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG);
+
return 0;
 }
 
@@ -732,7 +766,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
  */
 int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
-   int rc = 0;
+   int rc;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
@@ -743,6 +777,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 
hw = qi-desc;
 
+restart:
+   rc = 0;
+
spin_lock_irqsave(qi-q_lock, flags);
while (qi-free_cnt  3) {
spin_unlock_irqrestore(qi-q_lock, flags);
@@ -773,7 +810,7 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 * update the HW tail register indicating the presence of
 * new descriptors.
 */
-   writel(qi-free_head  4, iommu-reg + DMAR_IQT_REG);
+   writel(qi-free_head  DMAR_IQ_SHIFT, iommu-reg + DMAR_IQT_REG);
 
while (qi-desc_status[wait_index] != QI_DONE) {
/*
@@ -785,18 +822,21 @@ int qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 */
rc = qi_check_fault(iommu, index);
if (rc)
-   goto out;
+   break;
 
spin_unlock(qi-q_lock);
cpu_relax();
spin_lock(qi-q_lock);
}
-out:
-   qi-desc_status[index] = qi-desc_status[wait_index] = QI_DONE;
+
+   qi-desc_status[index] = QI_DONE;
 
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
 
+   if (rc == -EAGAIN)
+   goto restart;
+
return rc;
 }
 
@@ -863,6 +903,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr

[PATCH v4 resend 6/6] VT-d: support the device IOTLB

2009-05-13 Thread Yu Zhao

Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM
environments.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c   |  100 +-
 include/linux/intel-iommu.h |1 +
 2 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index a2cbc01..661a02b 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -128,6 +128,7 @@ static inline void context_set_fault_enable(struct 
context_entry *context)
 }
 
 #define CONTEXT_TT_MULTI_LEVEL 0
+#define CONTEXT_TT_DEV_IOTLB   1
 
 static inline void context_set_translation_type(struct context_entry *context,
unsigned long value)
@@ -251,6 +252,7 @@ struct device_domain_info {
int segment;/* PCI domain */
u8 bus; /* PCI bus number */
u8 devfn;   /* PCI devfn number */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct dmar_domain *domain; /* pointer to domain */
 };
@@ -965,6 +967,81 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
return 0;
 }
 
+static struct device_domain_info *
+iommu_support_dev_iotlb(struct dmar_domain *domain,
+   int segment, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(segment, bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found || !info-dev)
+   return NULL;
+
+   if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS))
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info)
+   return;
+
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   return;
+
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned mask)
+{
+   int rc;
+   u16 sid, qdep;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   qdep = pci_ats_queue_depth(info-dev);
+   rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask);
+   if (rc)
+   dev_err(info-dev-dev, flush IOTLB failed\n);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
@@ -988,6 +1065,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH,
non_present_entry_flush);
+   if (!rc  !non_present_entry_flush)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
+
return rc;
 }
 
@@ -1329,6 +1409,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1394,7 +1475,9 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
context_set_domain_id(context, id);
context_set_address_width(context, iommu-agaw);
context_set_address_root(context, virt_to_phys(pgd));
-   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
+   info = iommu_support_dev_iotlb(domain, segment, bus, devfn);
+   context_set_translation_type(context,
+   info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL);
context_set_fault_enable(context);
context_set_present(context

Re: KVM x86_64 with SR-IOV..?

2009-05-04 Thread Yu Zhao

Hi,

The VF also works in the host if the VF driver is programed properly.
So it would be easier to develop the VF driver in the host and then
verify the VF driver in the guest.

BTW, I didn't see the SR-IOV is enabled in your dmesg, did you select
the CONFIG_PCI_IOV in the kernel .config?

Thanks,
Yu

On Mon, May 04, 2009 at 06:40:36PM +0800, Nicholas A. Bellinger wrote:
 On Mon, 2009-05-04 at 17:49 +0800, Sheng Yang wrote:
  On Monday 04 May 2009 17:11:59 Nicholas A. Bellinger wrote:
   On Mon, 2009-05-04 at 16:20 +0800, Sheng Yang wrote:
On Monday 04 May 2009 12:36:04 Nicholas A. Bellinger wrote:
 On Mon, 2009-05-04 at 10:09 +0800, Sheng Yang wrote:
  On Monday 04 May 2009 08:53:07 Nicholas A. Bellinger wrote:
   On Sat, 2009-05-02 at 18:22 +0800, Sheng Yang wrote:
On Thu, Apr 30, 2009 at 01:22:54PM -0700, Nicholas A. Bellinger
  wrote:
 Greetings KVM folks,

 I wondering if any information exists for doing SR-IOV on the
 new VT-d capable chipsets with KVM..?  From what I understand
 the patches for doing this with KVM are floating around, but I
 have been unable to find any user-level docs for actually
 making it all go against a upstream v2.6.30-rc3 code..

 So far I have been doing IOV testing with Xen 3.3 and
 3.4.0-pre, and I am really hoping to be able to jump to KVM 
 for
 single-function and and then multi-function SR-IOV.  I know
 that the VM migration stuff for IOV in Xen is up and running,
 and I assume it is being worked in for KVM instance migration
 as well..? This part is less important (at least for me :-)
 than getting a stable SR-IOV setup running under the KVM
 hypervisor..  Does anyone have any pointers for this..?

 Any comments or suggestions are appreciated!
   
Hi Nicholas
   
The patches are not floating around now. As you know, SR-IOV for
Linux have been in 2.6.30, so then you can use upstream KVM and
qemu-kvm(or recent released kvm-85) with 2.6.30-rc3 as host
kernel. And some time ago, there are several SRIOV related
patches for qemu-kvm, and now they all have been checked in.
   
And for KVM, the extra document is not necessary, for you can
simple assign a VF to guest like any other devices. And how to
create VF is specific for each device driver. So just create a 
VF
then assign it to KVM guest is fine.
  
   Greetings Sheng,
  
   So, I have been trying the latest kvm-85 release on a v2.6.30-rc3
   checkout from linux-2.6.git on a CentOS 5u3 x86_64 install on 
   Intel
   IOH-5520 based dual socket Nehalem board.  I have enabled DMAR and
   Interrupt Remapping my KVM host using v2.6.30-rc3 and from what I
   can tell, the KVM_CAP_* defines from libkvm are enabled with
   building kvm-85 after './configure
   --kerneldir=/usr/src/linux-2.6.git' and the PCI passthrough code 
   is
   being enabled in
   kvm-85/qemu/hw/device-assignment.c AFAICT..
  
   From there, I use the freshly installed qemu-x86_64-system binary
to
  
   start a Debian 5 x86_64 HVM (that previously had been moving
   network packets under Xen for PCIe passthrough). I see the MSI-X
   interrupt remapping working on the KVM host for the passed
   -pcidevice, and the MMIO mappings from the qemu build that I also
   saw while using Xen/qemu-dm built with PCI passthrough are there 
   as
   well..
 
  Hi Nicholas
 
   But while the KVM guest is booting, I see the following
   exception(s) from qemu-x86_64-system for one of the VFs for a
   multi-function PCIe device:
  
   BUG: kvm_destroy_phys_mem: invalid parameters (slot=-1)
 
  This one is mostly harmless.

 Ok, good to know..  :-)

   I try with one of the on-board e1000e ports (02:00.0) and I see 
   the
   same exception along with some MSI-X exceptions from
   qemu-x86_64-system in KVM guest.. However, I am still able to see
   the e1000e and the other vxge multi-function device with lspci, 
   but
   I am unable to dhcp or ping with the e1000e and VF from
   multi-function device fails to register the MSI-X interrupt in the
   guest..
 
  Did you see the interrupt in the guest and host side?

 Ok, I am restarting the e1000e test with a fresh Fedora 11 install and
 KVM host kernel 2.6.29.1-111.fc11.x86_64.   After unbinding and
 attaching the e1000e single-function device at 02:00.0 to pci-stub
 with:

echo 8086 10d3  /sys/bus/pci/drivers/pci-stub/new_id
echo :02:00.0  /sys/bus/pci/devices/:02:00.0/driver/unbind
echo :02:00.0  /sys/bus/pci/drivers/pci-stub/bind

 I see the following the KVM host kernel ring buffer:

[RFC PATCH 1/3] PCI: rewrite Function Level Reset

2009-04-06 Thread Yu Zhao

Changes:
  1) remove disable_irq() so the shared IRQ won't be disabled.
  2) replace the 1s wait with 100, 200 and 400ms wait intervals
 for the Pending Transaction.
  3) replace mdelay() with msleep().
  4) add might_sleep().
  5) lock the device to prevent PM suspend from accessing the CSRs
 during the reset.
  6) coding style fixes.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/pci.c   |  166 ++-
 include/linux/pci.h |2 +-
 2 files changed, 85 insertions(+), 83 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index af4db4e..46ae997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2008,111 +2008,112 @@ int pci_set_dma_seg_boundary(struct pci_dev *dev, 
unsigned long mask)
 EXPORT_SYMBOL(pci_set_dma_seg_boundary);
 #endif
 
-static int __pcie_flr(struct pci_dev *dev, int probe)
+static int pcie_flr(struct pci_dev *dev, int probe)
 {
-   u16 status;
+   int i;
+   int pos;
u32 cap;
-   int exppos = pci_find_capability(dev, PCI_CAP_ID_EXP);
+   u16 status;
 
-   if (!exppos)
+   pos = pci_find_capability(dev, PCI_CAP_ID_EXP);
+   if (!pos)
return -ENOTTY;
-   pci_read_config_dword(dev, exppos + PCI_EXP_DEVCAP, cap);
+
+   pci_read_config_dword(dev, pos + PCI_EXP_DEVCAP, cap);
if (!(cap  PCI_EXP_DEVCAP_FLR))
return -ENOTTY;
 
if (probe)
return 0;
 
-   pci_block_user_cfg_access(dev);
-
/* Wait for Transaction Pending bit clean */
-   pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status);
-   if (!(status  PCI_EXP_DEVSTA_TRPND))
-   goto transaction_done;
+   for (i = 0; i  4; i++) {
+   if (i)
+   msleep((1  (i - 1)) * 100);
 
-   msleep(100);
-   pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status);
-   if (!(status  PCI_EXP_DEVSTA_TRPND))
-   goto transaction_done;
-
-   dev_info(dev-dev, Busy after 100ms while trying to reset; 
-   sleeping for 1 second\n);
-   ssleep(1);
-   pci_read_config_word(dev, exppos + PCI_EXP_DEVSTA, status);
-   if (status  PCI_EXP_DEVSTA_TRPND)
-   dev_info(dev-dev, Still busy after 1s; 
-   proceeding with reset anyway\n);
-
-transaction_done:
-   pci_write_config_word(dev, exppos + PCI_EXP_DEVCTL,
+   pci_read_config_word(dev, pos + PCI_EXP_DEVSTA, status);
+   if (!(status  PCI_EXP_DEVSTA_TRPND))
+   goto clear;
+   }
+
+   dev_err(dev-dev, transaction is not cleared; 
+   proceeding with reset anyway\n);
+
+clear:
+   pci_write_config_word(dev, pos + PCI_EXP_DEVCTL,
PCI_EXP_DEVCTL_BCR_FLR);
-   mdelay(100);
+   msleep(100);
 
-   pci_unblock_user_cfg_access(dev);
return 0;
 }
 
-static int __pci_af_flr(struct pci_dev *dev, int probe)
+static int pci_af_flr(struct pci_dev *dev, int probe)
 {
-   int cappos = pci_find_capability(dev, PCI_CAP_ID_AF);
-   u8 status;
+   int i;
+   int pos;
u8 cap;
+   u8 status;
 
-   if (!cappos)
+   pos = pci_find_capability(dev, PCI_CAP_ID_AF);
+   if (!pos)
return -ENOTTY;
-   pci_read_config_byte(dev, cappos + PCI_AF_CAP, cap);
+
+   pci_read_config_byte(dev, pos + PCI_AF_CAP, cap);
if (!(cap  PCI_AF_CAP_TP) || !(cap  PCI_AF_CAP_FLR))
return -ENOTTY;
 
if (probe)
return 0;
 
-   pci_block_user_cfg_access(dev);
-
/* Wait for Transaction Pending bit clean */
-   pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status);
-   if (!(status  PCI_AF_STATUS_TP))
-   goto transaction_done;
+   for (i = 0; i  4; i++) {
+   if (i)
+   msleep((1  (i - 1)) * 100);
+
+   pci_read_config_byte(dev, pos + PCI_AF_STATUS, status);
+   if (!(status  PCI_AF_STATUS_TP))
+   goto clear;
+   }
+
+   dev_err(dev-dev, transaction is not cleared; 
+   proceeding with reset anyway\n);
 
+clear:
+   pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
msleep(100);
-   pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status);
-   if (!(status  PCI_AF_STATUS_TP))
-   goto transaction_done;
-
-   dev_info(dev-dev, Busy after 100ms while trying to
-reset; sleeping for 1 second\n);
-   ssleep(1);
-   pci_read_config_byte(dev, cappos + PCI_AF_STATUS, status);
-   if (status  PCI_AF_STATUS_TP)
-   dev_info(dev-dev, Still busy after 1s; 
-   proceeding with reset anyway\n);
-
-transaction_done:
-   pci_write_config_byte(dev, cappos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
-   mdelay(100

[RFC PATCH 3/3] PCI: support Secondary Bus Reset

2009-04-06 Thread Yu Zhao

PCI-to-PCI Bridge 1.2 specifies that the Secondary Bus Reset bit can
force the assertion of RST# on the secondary interface, which can be
used to reset all devices including subordinates under this bus. This
can be used to reset a function if this function is the only device
under this bus.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/pci.c |   31 +++
 1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e459a0b..a77c33a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2115,6 +2115,33 @@ static int pci_pm_flr(struct pci_dev *dev, int probe)
return 0;
 }
 
+static int pci_secondary_bus_reset(struct pci_dev *dev, int probe)
+{
+   u16 ctrl;
+   struct pci_dev *pdev;
+
+   if (dev-subordinate)
+   return -ENOTTY;
+
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev != dev)
+   return -ENOTTY;
+
+   if (probe)
+   return 0;
+
+   pci_read_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl);
+   ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+   pci_write_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl);
+   msleep(100);
+
+   ctrl = ~PCI_BRIDGE_CTL_BUS_RESET;
+   pci_write_config_word(dev-bus-self, PCI_BRIDGE_CONTROL, ctrl);
+   msleep(100);
+
+   return 0;
+}
+
 static int pci_dev_reset(struct pci_dev *dev, int probe)
 {
int rc;
@@ -2136,6 +2163,10 @@ static int pci_dev_reset(struct pci_dev *dev, int probe)
goto done;
 
rc = pci_pm_flr(dev, probe);
+   if (rc != -ENOTTY)
+   goto done;
+
+   rc = pci_secondary_bus_reset(dev, probe);
 done:
up(dev-dev.sem);
 
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/6] PCI: support the ATS capability

2009-03-29 Thread Yu Zhao

On Sun, Mar 29, 2009 at 09:51:31PM +0800, Matthew Wilcox wrote:
 On Thu, Mar 26, 2009 at 04:15:56PM -0700, Jesse Barnes wrote:
 2, avoid using pci_find_ext_capability every time when reading ATS
Invalidate Queue Depth (Matthew Wilcox)
 
 I asked a question about how that was used, and got back a version which
 changed how it was done.  I still don't have an answer to my question.

VT-d hardware is designed as that the Invalidate Queue Depth is used
every time when the software prepares the Invalidate Request descriptor.
This happens when the device IOMMU mapping changes (i.e. device driver
calls DMA map/unmap if the device is use by the host; or when a guest
is started/destroyed if the device is assigned to this guest).

Given the DMA map/unmap are used very frequently, I suppose the queue
depth should be cached somewhere. And it used to be cached in the VT-d
private data structure (before v3) because I'm not sure about how the
IOMMU hardware from other vendors use the queue depth.

After you commented the code, I checked the AMD/IBM/Sun IOMMU: AMD IOMMU
also uses the invalidate queue for every Invalidate Request descriptor;
IBM/Sun IOMMUs don't look like supporting the ATS. So it's reasonable to
cache the queue depth in the PCI subsystem since all IOMMUs that support
the ATS use the queue depth in the same way (very frequently), right?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 0/6] PCI: support the ATS capability

2009-03-23 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. The PCIe Endpoint that supports ATS capability can
request the DMA address translation from the IOMMU and cache the
translation itself. This can alleviate IOMMU TLB pressure and improve
the hardware performance in the I/O virtualization environment.

The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The
spec can be found at: http://www.pcisig.com/specifications/iov/ats/
(it requires membership).


Changelog:
v3 - v4
  1, coding style fixes (Grant Grundler)
  2, support the Virtual Function ATS capability

v2 - v3
  1, throw error message if VT-d hardware detects invalid descriptor
 on Queued Invalidation interface (David Woodhouse)
  2, avoid using pci_find_ext_capability every time when reading ATS
 Invalidate Queue Depth (Matthew Wilcox)

v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)


Yu Zhao (6):
  PCI: support the ATS capability
  PCI: handle Virtual Function ATS enabling
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c  |  189 +++---
 drivers/pci/intel-iommu.c   |  139 ++--
 drivers/pci/iov.c   |  155 ++--
 drivers/pci/pci.h   |   39 +
 include/linux/dmar.h|9 ++
 include/linux/intel-iommu.h |   16 -
 include/linux/pci.h |2 +
 include/linux/pci_regs.h|   10 +++
 8 files changed, 514 insertions(+), 45 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 1/6] PCI: support the ATS capability

2009-03-23 Thread Yu Zhao

The PCIe ATS capability makes the Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation
in the device side, thus alleviate IOMMU TLB pressure and improve
the hardware performance in the I/O virtualization environment.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c|  105 ++
 drivers/pci/pci.h|   37 
 include/linux/pci.h  |2 +
 include/linux/pci_regs.h |   10 
 4 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 7227efc..8a9817c 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -5,6 +5,7 @@
  *
  * PCI Express I/O Virtualization (IOV) support.
  *   Single Root IOV 1.0
+ *   Address Translation Service 1.0
  */
 
 #include linux/pci.h
@@ -678,3 +679,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev)
return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_migration);
+
+static int ats_alloc_one(struct pci_dev *dev, int pgshift)
+{
+   int pos;
+   u16 cap;
+   struct pci_ats *ats;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   ats = kzalloc(sizeof(*ats), GFP_KERNEL);
+   if (!ats)
+   return -ENOMEM;
+
+   ats-pos = pos;
+   ats-stu = pgshift;
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+   ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+   PCI_ATS_MAX_QDEP;
+   dev-ats = ats;
+
+   return 0;
+}
+
+static void ats_free_one(struct pci_dev *dev)
+{
+   kfree(dev-ats);
+   dev-ats = NULL;
+}
+
+/**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @pgshift: the IOMMU page shift
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_ats(struct pci_dev *dev, int pgshift)
+{
+   int rc;
+   u16 ctrl;
+
+   BUG_ON(dev-ats);
+
+   if (pgshift  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   rc = ats_alloc_one(dev, pgshift);
+   if (rc)
+   return rc;
+
+   ctrl = PCI_ATS_CTRL_ENABLE;
+   ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU);
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   BUG_ON(!dev-ats);
+
+   pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   ats_free_one(dev);
+}
+
+/**
+ * pci_ats_queue_depth - query the ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or negative on failure.
+ *
+ * The ATS spec uses 0 in the Invalidate Queue Depth field to
+ * indicate that the function can accept 32 Invalidate Request.
+ * But here we use the `real' values (i.e. 1~32) for the Queue
+ * Depth.
+ */
+int pci_ats_queue_depth(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   if (dev-ats)
+   return dev-ats-qdep;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+  PCI_ATS_MAX_QDEP;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index dd7c63f..9f0db6a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -218,6 +218,13 @@ struct pci_sriov {
u8 __iomem *mstate; /* VF Migration State Array */
 };
 
+/* Address Translation Service */
+struct pci_ats {
+   int pos;/* capability position */
+   int stu;/* Smallest Translation Unit */
+   int qdep;   /* Invalidate Queue Depth */
+};
+
 #ifdef CONFIG_PCI_IOV
 extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
@@ -225,6 +232,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int 
resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
 extern int pci_iov_bus_range(struct pci_bus *bus);
+
+extern int pci_enable_ats(struct pci_dev *dev, int pgshift);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_queue_depth(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return !!dev-ats;
+}
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -246,6 +267,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus

[PATCH v4 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-03-23 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in the DMA
Remapping Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index 26c536b..106bc45 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -349,6 +433,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -417,11 +506,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct

[PATCH v4 2/6] PCI: handle Virtual Function ATS enabling

2009-03-23 Thread Yu Zhao

The SR-IOV spec requires the Smallest Translation Unit and the
Invalidate Queue Depth fields in the Virtual Function ATS capability
to be hardwired to 0. If a function is a Virtual Function, then and
set its Physical Function's STU before enabling the ATS.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   66 +---
 drivers/pci/pci.h |4 ++-
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 8a9817c..0bf23fc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -491,10 +491,10 @@ found:
 
if (pdev)
iov-dev = pci_dev_get(pdev);
-   else {
+   else
iov-dev = dev;
-   mutex_init(iov-lock);
-   }
+
+   mutex_init(iov-lock);
 
dev-sriov = iov;
dev-is_physfn = 1;
@@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev)
 {
BUG_ON(dev-sriov-nr_virtfn);
 
-   if (dev == dev-sriov-dev)
-   mutex_destroy(dev-sriov-lock);
-   else
+   if (dev != dev-sriov-dev)
pci_dev_put(dev-sriov-dev);
 
+   mutex_destroy(dev-sriov-lock);
+
kfree(dev-sriov);
dev-sriov = NULL;
 }
@@ -722,19 +722,40 @@ int pci_enable_ats(struct pci_dev *dev, int pgshift)
int rc;
u16 ctrl;
 
-   BUG_ON(dev-ats);
+   BUG_ON(dev-ats  dev-ats-is_enabled);
 
if (pgshift  PCI_ATS_MIN_STU)
return -EINVAL;
 
-   rc = ats_alloc_one(dev, pgshift);
-   if (rc)
-   return rc;
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   if (pdev-ats)
+   rc = pdev-ats-stu == pgshift ? 0 : -EINVAL;
+   else
+   rc = ats_alloc_one(pdev, pgshift);
+
+   if (!rc)
+   pdev-ats-ref_cnt++;
+   mutex_unlock(pdev-sriov-lock);
+   if (rc)
+   return rc;
+   }
+
+   if (!dev-is_physfn) {
+   rc = ats_alloc_one(dev, pgshift);
+   if (rc)
+   return rc;
+   }
 
ctrl = PCI_ATS_CTRL_ENABLE;
-   ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU);
+   if (!dev-is_virtfn)
+   ctrl |= PCI_ATS_CTRL_STU(pgshift - PCI_ATS_MIN_STU);
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
+   dev-ats-is_enabled = 1;
+
return 0;
 }
 
@@ -746,13 +767,26 @@ void pci_disable_ats(struct pci_dev *dev)
 {
u16 ctrl;
 
-   BUG_ON(!dev-ats);
+   BUG_ON(!dev-ats || !dev-ats-is_enabled);
 
pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
ctrl = ~PCI_ATS_CTRL_ENABLE;
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
-   ats_free_one(dev);
+   dev-ats-is_enabled = 0;
+
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   pdev-ats-ref_cnt--;
+   if (!pdev-ats-ref_cnt)
+   ats_free_one(pdev);
+   mutex_unlock(pdev-sriov-lock);
+   }
+
+   if (!dev-is_physfn)
+   ats_free_one(dev);
 }
 
 /**
@@ -764,13 +798,17 @@ void pci_disable_ats(struct pci_dev *dev)
  * The ATS spec uses 0 in the Invalidate Queue Depth field to
  * indicate that the function can accept 32 Invalidate Request.
  * But here we use the `real' values (i.e. 1~32) for the Queue
- * Depth.
+ * Depth; and 0 indicates the function shares the Queue with
+ * other functions (doesn't exclusively own a Queue).
  */
 int pci_ats_queue_depth(struct pci_dev *dev)
 {
int pos;
u16 cap;
 
+   if (dev-is_virtfn)
+   return 0;
+
if (dev-ats)
return dev-ats-qdep;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9f0db6a..8ecd185 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -223,6 +223,8 @@ struct pci_ats {
int pos;/* capability position */
int stu;/* Smallest Translation Unit */
int qdep;   /* Invalidate Queue Depth */
+   int ref_cnt;/* Physical Function reference count */
+   int is_enabled:1;   /* Enable bit is set */
 };
 
 #ifdef CONFIG_PCI_IOV
@@ -244,7 +246,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev);
  */
 static inline int pci_ats_enabled(struct pci_dev *dev)
 {
-   return !!dev-ats;
+   return dev-ats  dev-ats-is_enabled;
 }
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 6/6] VT-d: support the device IOTLB

2009-03-23 Thread Yu Zhao

Enable the device IOTLB (i.e. ATS) for both the bare metal and KVM
environments.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c   |   99 +-
 include/linux/intel-iommu.h |1 +
 2 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 3145368..799bbe5 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -127,6 +127,7 @@ static inline void context_set_fault_enable(struct 
context_entry *context)
 }
 
 #define CONTEXT_TT_MULTI_LEVEL 0
+#define CONTEXT_TT_DEV_IOTLB   1
 
 static inline void context_set_translation_type(struct context_entry *context,
unsigned long value)
@@ -242,6 +243,7 @@ struct device_domain_info {
struct list_head global; /* link to global list */
u8 bus; /* PCI bus numer */
u8 devfn;   /* PCI devfn number */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct dmar_domain *domain; /* pointer to domain */
 };
@@ -924,6 +926,80 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
return 0;
 }
 
+static struct device_domain_info *
+iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found || !info-dev)
+   return NULL;
+
+   if (!pci_find_ext_capability(info-dev, PCI_EXT_CAP_ID_ATS))
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info)
+   return;
+
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   return;
+
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned mask)
+{
+   int rc;
+   u16 sid, qdep;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   qdep = pci_ats_queue_depth(info-dev);
+   rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask);
+   if (rc)
+   dev_err(info-dev-dev, flush IOTLB failed\n);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
@@ -947,6 +1023,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH,
non_present_entry_flush);
+   if (!rc  !non_present_entry_flush)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
+
return rc;
 }
 
@@ -1471,6 +1550,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1536,7 +1616,9 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
context_set_domain_id(context, id);
context_set_address_width(context, iommu-agaw);
context_set_address_root(context, virt_to_phys(pgd));
-   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
+   info = iommu_support_dev_iotlb(domain, bus, devfn);
+   context_set_translation_type(context,
+   info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL);
context_set_fault_enable(context);
context_set_present(context);
domain_flush_cache(domain, context

Re: [PATCH v3 0/6] ATS capability support for Intel IOMMU

2009-03-22 Thread Yu Zhao

On Fri, Mar 20, 2009 at 07:15:51PM +0800, David Woodhouse wrote:
 On Fri, 2009-03-20 at 10:47 +0800, Zhao, Yu wrote:
  If it's possible, I'd like it go through the PCI tree because the ATS 
  depends on the SR-IOV. This dependency is not reflected in this v3 
  series since the SR-IOV is not in-tree and I don't want to break the 
  build after people apply the ATS on their tree.
 
 In what way will it depend on SR-IOV?

The SR-IOV spec section 3.7.4 says that the Smallest Translation Unit and
the Invalidate Queue Depth fields in the Virtual Function's ATS capability
are hard-wired to 0. So we need some special handling when enabling the ATS
capability for the Virtual Function.

  Table 3-26: ATS Capability Register
-+-+---+--
Bit Location | PF and VF Register Differences From ATS | PF Attributes | VF 
Attributes
-+-+---+--
 |Smallest Translation Unit (STU)  |   |
   20:16 |Hardwired to 0 for VFs.  |  ATS  |  RO
 |PF value applies to all VFs. |   |
-+-+---+--
 |Invalidate Queue Depth   |   |
   28:24 |Hardwired to 0 for VFs.  |  ATS  |  RO
 |Depth of shared PF input queue.  |   |
-+-+---+--

  So Dave, can I get an ack from you and let Jesse pull the IOMMU change
  to his tree? Or let this ATS go to 2.6.31?
 
 Want to show the latest version of the patches which depend on SR-IOV,
 and I can ack them?

Sure, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 1/8] PCI: initialize and release SR-IOV capability

2009-03-19 Thread Yu Zhao

On Fri, Mar 20, 2009 at 03:53:12AM +0800, Matthew Wilcox wrote:
 On Wed, Mar 11, 2009 at 03:25:42PM +0800, Yu Zhao wrote:
  +config PCI_IOV
  +   bool PCI IOV support
  +   depends on PCI
  +   help
  + PCI-SIG I/O Virtualization (IOV) Specifications support.
  + Single Root IOV: allows the creation of virtual PCI devices
  + that share the physical resources from a real device.
  +
  + When in doubt, say N.
 
 It's certainly shorter than my text, which is nice.  But I think it
 still has too much spec-ese and not enough explanation.  How about:
 
   help
 I/O Virtualization is a PCI feature supported by some devices
 which allows them to create virtual devices which share their
 physical resources.
 
 If unsure, say N.

Yes, it's more user-friendly.

  +   list_for_each_entry(pdev, dev-bus-devices, bus_list)
  +   if (pdev-is_physfn)
  +   break;
  +   if (list_empty(dev-bus-devices) || !pdev-is_physfn)
  +   pdev = NULL;
 
 This is still wrong.  If the 'break' condition is not hit, pdev is
 pointing to garbage, not to the last pci_dev in the list.

Yes, you are right. I should think it over after you commented on it
last time.

So it looks like we need to make it as:

ctrl = 0;
list_for_each_entry(pdev, dev-bus-devices, bus_list)
if (pdev-is_physfn)
goto found;

pdev = NULL;
if (pci_ari_enabled(dev-bus))
ctrl |= PCI_SRIOV_CTRL_ARI;

found:
pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
...

  @@ -270,6 +278,7 @@ struct pci_dev {
  struct list_head msi_list;
   #endif
  struct pci_vpd *vpd;
  +   struct pci_sriov *sriov;/* SR-IOV capability related */
 
 Should be ifdeffed?

Yes, will do.


Thank you for reviewing it. The patch series was applied on Xen Domain0
tree 2 days ago, and I'll carry your comments back to Xen tree too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 0/8] PCI: Linux kernel SR-IOV support

2009-03-19 Thread Yu Zhao

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
  
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
  http://download.intel.com/design/network/ProdBrf/320025.pdf
  http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
  http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf

The patches to enable the SR-IOV capability of Intel 82576 NIC are
available at (a.k.a Physical Function driver):
  http://patchwork.kernel.org/patch/8063/
  http://patchwork.kernel.org/patch/8064/
  http://patchwork.kernel.org/patch/8065/
  http://patchwork.kernel.org/patch/8066/
And the driver for Intel 82576 Virtual Function are available at:
  http://patchwork.kernel.org/patch/11029/
  http://patchwork.kernel.org/patch/11028/


Major changes from v11 to v12:
  1, fix using garbage entry pointer after the list_for_each (Matthew Wilcox)
  2, use #ifdef around SR-IOV structure in the pci_dev (Matthew Wilcox)
  3, enhance the Kconfig help text for the SR-IOV (Matthew Wilcox)

  v10 to v11:
  1, use pci_setup_device() to setup Virtual Function (Matthew Wilcox)
  2, various coding style fixes (Matthew Wilcox)
  3, wording and grammar fixes (Randy Dunlap)

  v9 - v10:
  1, minor fix in pci_restore_iov_state().
  2, respin against the latest tree.

  v8 - v9:
  1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen)
  2, block user config accesses before clearing VF Enable bit (Matthew Wilcox)

Yu Zhao (8):
  PCI: initialize and release SR-IOV capability
  PCI: restore saved SR-IOV state
  PCI: reserve bus range for SR-IOV device
  PCI: centralize device setup code
  PCI: add SR-IOV API for Physical Function driver
  PCI: handle SR-IOV Virtual Function Migration
  PCI: document SR-IOV sysfs entries
  PCI: manual for SR-IOV user and driver developer

 Documentation/ABI/testing/sysfs-bus-pci |   27 ++
 Documentation/DocBook/kernel-api.tmpl   |1 +
 Documentation/PCI/pci-iov-howto.txt |   99 +
 drivers/pci/Kconfig |   10 +
 drivers/pci/Makefile|2 +
 drivers/pci/iov.c   |  680 +++
 drivers/pci/pci.c   |8 +
 drivers/pci/pci.h   |   53 +++
 drivers/pci/probe.c |   86 +++--
 include/linux/pci.h |   34 ++
 include/linux/pci_regs.h|   33 ++
 11 files changed, 994 insertions(+), 39 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt
 create mode 100644 drivers/pci/iov.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 1/8] PCI: initialize and release SR-IOV capability

2009-03-19 Thread Yu Zhao

If a device has the SR-IOV capability, initialize it (set the ARI
Capable Hierarchy in the lowest numbered PF if necessary; calculate
the System Page Size for the VF MMIO, probe the VF Offset, Stride
and BARs). A lock for the VF bus allocation is also initialized if
a PF is the lowest numbered PF.

Reviewed-by: Matthew Wilcox wi...@linux.intel.com
Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/Kconfig  |   10 +++
 drivers/pci/Makefile |2 +
 drivers/pci/iov.c|  182 ++
 drivers/pci/pci.c|7 ++
 drivers/pci/pci.h|   37 +
 drivers/pci/probe.c  |4 +
 include/linux/pci.h  |   11 +++
 include/linux/pci_regs.h |   33 
 8 files changed, 286 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..fdc864f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,13 @@ config HT_IRQ
   This allows native hypertransport devices to use interrupts.
 
   If unsure say Y.
+
+config PCI_IOV
+   bool PCI IOV support
+   depends on PCI
+   help
+ I/O Virtualization is a PCI feature supported by some devices
+ which allows them to create virtual devices which share their
+ physical resources.
+
+ If unsure, say N.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba6af16 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,8 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 000..66cc414
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,182 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include linux/pci.h
+#include linux/mutex.h
+#include linux/string.h
+#include linux/delay.h
+#include pci.h
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+   int i;
+   int rc;
+   int nres;
+   u32 pgsz;
+   u16 ctrl, total, offset, stride;
+   struct pci_sriov *iov;
+   struct resource *res;
+   struct pci_dev *pdev;
+
+   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
+   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE) {
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+   ssleep(1);
+   }
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total);
+   if (!total)
+   return 0;
+
+   ctrl = 0;
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev-is_physfn)
+   goto found;
+
+   pdev = NULL;
+   if (pci_ari_enabled(dev-bus))
+   ctrl |= PCI_SRIOV_CTRL_ARI;
+
+found:
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
+   if (!offset || (total  1  !stride))
+   return -EIO;
+
+   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
+   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
+   pgsz = ~((1  i) - 1);
+   if (!pgsz)
+   return -EIO;
+
+   pgsz = ~(pgsz - 1);
+   pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+   nres = 0;
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_IOV_RESOURCES + i;
+   i += __pci_read_base(dev, pci_bar_unknown, res,
+pos + PCI_SRIOV_BAR + i * 4);
+   if (!res-flags)
+   continue;
+   if (resource_size(res)  (PAGE_SIZE - 1)) {
+   rc = -EIO;
+   goto failed;
+   }
+   res-end = res-start + resource_size(res) * total - 1;
+   nres++;
+   }
+
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
+   iov-pos = pos;
+   iov-nres = nres;
+   iov-ctrl = ctrl;
+   iov-total = total;
+   iov-offset = offset;
+   iov-stride = stride;
+   iov-pgsz = pgsz;
+   iov-self = dev;
+   pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap);
+   pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link);
+
+   if (pdev)
+   iov-dev = pci_dev_get(pdev);
+   else

[PATCH v12 8/8] PCI: manual for SR-IOV user and driver developer

2009-03-19 Thread Yu Zhao

Reviewed-by: Randy Dunlap rdun...@xenotime.net
Reviewed-by: Matthew Wilcox wi...@linux.intel.com
Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/DocBook/kernel-api.tmpl |1 +
 Documentation/PCI/pci-iov-howto.txt   |   99 +
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl 
b/Documentation/DocBook/kernel-api.tmpl
index bc962cd..58c1945 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c
 --
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
  /sect1
  sect1titlePCI Hotplug Support Library/title
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt 
b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+   PCI Express I/O Virtualization Howto
+   Copyright (C) 2009 Intel Corporation
+   Yu Zhao yu.z...@intel.com
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+   'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+   void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+   irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id 
*id)
+{
+   pci_enable_sriov(dev, NR_VIRTFN);
+
+   ...
+
+   return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+   pci_disable_sriov(dev);
+
+   ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+   ...
+
+   return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+   ...
+
+   return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+   ...
+}
+
+static struct pci_driver dev_driver = {
+   .name = SR-IOV Physical Function driver,
+   .id_table = dev_id_table,
+   .probe =dev_probe,
+   .remove =   __devexit_p(dev_remove),
+   .suspend =  dev_suspend,
+   .resume =   dev_resume,
+   .shutdown = dev_shutdown,
+};
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 0/8] PCI: Linux kernel SR-IOV support

2009-03-16 Thread Yu Zhao

Hi Matthew,

Can you please take a look at this new version? I'd like to make sure
that all concerns are addressed and I didn't miss something :-)

Thanks,
Yu

On Wed, Mar 11, 2009 at 03:25:41PM +0800, Yu Zhao wrote:
 Greetings,
 
 Following patches are intended to support SR-IOV capability in the
 Linux kernel. With these patches, people can turn a PCI device with
 the capability into multiple ones from software perspective, which
 will benefit KVM and achieve other purposes such as QoS, security,
 and etc.
 
 SR-IOV specification can be found at:
   
 http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
 (it requires membership.)
 
 Devices that support SR-IOV are available from following vendors:
   http://download.intel.com/design/network/ProdBrf/320025.pdf
   http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
   http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf
 
 The patches to enable the SR-IOV capability of Intel 82576 NIC are
 available at (a.k.a Physical Function driver):
   http://patchwork.kernel.org/patch/8063/
   http://patchwork.kernel.org/patch/8064/
   http://patchwork.kernel.org/patch/8065/
   http://patchwork.kernel.org/patch/8066/
 And the driver for Intel 82576 Virtual Function are available at:
   http://patchwork.kernel.org/patch/11029/
   http://patchwork.kernel.org/patch/11028/
 
 
 Major changes from v10 to v11:
   1, use pci_setup_device() to setup Virtual Function (Matthew Wilcox)
   2, various coding style fixes (Matthew Wilcox)
   3, wording and grammar fixes (Randy Dunlap)
 
   v9 - v10:
   1, minor fix in pci_restore_iov_state().
   2, respin against the latest tree.
 
   v8 - v9:
   1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen)
   2, block user config accesses before clearing VF Enable bit (Matthew Wilcox)
 
 
 Yu Zhao (8):
   PCI: initialize and release SR-IOV capability
   PCI: restore saved SR-IOV state
   PCI: reserve bus range for SR-IOV device
   PCI: centralize device setup code into pci_setup_device()
   PCI: add SR-IOV API for Physical Function driver
   PCI: handle SR-IOV Virtual Function Migration
   PCI: document SR-IOV sysfs entries
   PCI: manual for SR-IOV user and driver developer
 
  Documentation/ABI/testing/sysfs-bus-pci |   27 ++
  Documentation/DocBook/kernel-api.tmpl   |1 +
  Documentation/PCI/pci-iov-howto.txt |   99 +
  drivers/pci/Kconfig |   10 +
  drivers/pci/Makefile|2 +
  drivers/pci/iov.c   |  677 
 +++
  drivers/pci/pci.c   |8 +
  drivers/pci/pci.h   |   53 +++
  drivers/pci/probe.c |   86 +++--
  include/linux/pci.h |   32 ++
  include/linux/pci_regs.h|   33 ++
  11 files changed, 989 insertions(+), 39 deletions(-)
  create mode 100644 Documentation/PCI/pci-iov-howto.txt
  create mode 100644 drivers/pci/iov.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 1/8] PCI: initialize and release SR-IOV capability

2009-03-11 Thread Yu Zhao

If a device has the SR-IOV capability, initialize it (set the ARI
Capable Hierarchy in the lowest numbered PF if necessary; calculate
the System Page Size for the VF MMIO, probe the VF Offset, Stride
and BARs). A lock for the VF bus allocation is also initialized if
a PF is the lowest numbered PF.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/Kconfig  |   10 +++
 drivers/pci/Makefile |2 +
 drivers/pci/iov.c|  182 ++
 drivers/pci/pci.c|7 ++
 drivers/pci/pci.h|   37 +
 drivers/pci/probe.c  |4 +
 include/linux/pci.h  |9 ++
 include/linux/pci_regs.h |   33 
 8 files changed, 284 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..25cf360 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,13 @@ config HT_IRQ
   This allows native hypertransport devices to use interrupts.
 
   If unsure say Y.
+
+config PCI_IOV
+   bool PCI IOV support
+   depends on PCI
+   help
+ PCI-SIG I/O Virtualization (IOV) Specifications support.
+ Single Root IOV: allows the creation of virtual PCI devices
+ that share the physical resources from a real device.
+
+ When in doubt, say N.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba6af16 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,8 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 000..656216c
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,182 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include linux/pci.h
+#include linux/mutex.h
+#include linux/string.h
+#include linux/delay.h
+#include pci.h
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+   int i;
+   int rc;
+   int nres;
+   u32 pgsz;
+   u16 ctrl, total, offset, stride;
+   struct pci_sriov *iov;
+   struct resource *res;
+   struct pci_dev *pdev;
+
+   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
+   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE) {
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+   ssleep(1);
+   }
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total);
+   if (!total)
+   return 0;
+
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev-is_physfn)
+   break;
+   if (list_empty(dev-bus-devices) || !pdev-is_physfn)
+   pdev = NULL;
+
+   ctrl = 0;
+   if (!pdev  pci_ari_enabled(dev-bus))
+   ctrl |= PCI_SRIOV_CTRL_ARI;
+
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
+   if (!offset || (total  1  !stride))
+   return -EIO;
+
+   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
+   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
+   pgsz = ~((1  i) - 1);
+   if (!pgsz)
+   return -EIO;
+
+   pgsz = ~(pgsz - 1);
+   pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+   nres = 0;
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_IOV_RESOURCES + i;
+   i += __pci_read_base(dev, pci_bar_unknown, res,
+pos + PCI_SRIOV_BAR + i * 4);
+   if (!res-flags)
+   continue;
+   if (resource_size(res)  (PAGE_SIZE - 1)) {
+   rc = -EIO;
+   goto failed;
+   }
+   res-end = res-start + resource_size(res) * total - 1;
+   nres++;
+   }
+
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
+   iov-pos = pos;
+   iov-nres = nres;
+   iov-ctrl = ctrl;
+   iov-total = total;
+   iov-offset = offset;
+   iov-stride = stride;
+   iov-pgsz = pgsz;
+   iov-self = dev;
+   pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap);
+   pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link);
+
+   if (pdev)
+   iov-dev

[PATCH v11 2/8] PCI: restore saved SR-IOV state

2009-03-11 Thread Yu Zhao

Restore the volatile registers in the SR-IOV capability after the
D3-D0 transition.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   29 +
 drivers/pci/pci.c |1 +
 drivers/pci/pci.h |4 
 3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 656216c..8df2246 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -129,6 +129,25 @@ static void sriov_release(struct pci_dev *dev)
dev-sriov = NULL;
 }
 
+static void sriov_restore_state(struct pci_dev *dev)
+{
+   int i;
+   u16 ctrl;
+   struct pci_sriov *iov = dev-sriov;
+
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE)
+   return;
+
+   for (i = PCI_IOV_RESOURCES; i = PCI_IOV_RESOURCE_END; i++)
+   pci_update_resource(dev, i);
+
+   pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+   if (iov-ctrl  PCI_SRIOV_CTRL_VFE)
+   msleep(100);
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -180,3 +199,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
return dev-sriov-pos + PCI_SRIOV_BAR +
4 * (resno - PCI_IOV_RESOURCES);
 }
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+   if (dev-is_physfn)
+   sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2eba2a5..8e21912 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+   pci_restore_iov_state(dev);
 
return 0;
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 196be5e..efd79a2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, 
int resno,
 {
return 0;
 }
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 4/8] PCI: centralize device setup code

2009-03-11 Thread Yu Zhao

Move the device setup stuff into pci_setup_device() which will be used
to setup the Virtual Function later.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/pci.h   |1 +
 drivers/pci/probe.c |   79 ++-
 2 files changed, 41 insertions(+), 39 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 7abdef6..80ad848 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -178,6 +178,7 @@ enum pci_bar_type {
pci_bar_mem64,  /* A 64-bit memory BAR */
 };
 
+extern int pci_setup_device(struct pci_dev *dev);
 extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
 extern int pci_resource_bar(struct pci_dev *dev, int resno,
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4c8abd0..f4ca550 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -674,6 +674,19 @@ static void pci_read_irq(struct pci_dev *dev)
dev-irq = irq;
 }
 
+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+   int pos;
+   u16 reg16;
+
+   pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+   if (!pos)
+   return;
+   pdev-is_pcie = 1;
+   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16);
+   pdev-pcie_type = (reg16  PCI_EXP_FLAGS_TYPE)  4;
+}
+
 #define LEGACY_IO_RESOURCE (IORESOURCE_IO | IORESOURCE_PCI_FIXED)
 
 /**
@@ -683,12 +696,34 @@ static void pci_read_irq(struct pci_dev *dev)
  * Initialize the device structure with information about the device's 
  * vendor,class,memory and IO-space addresses,IRQ lines etc.
  * Called at initialisation of the PCI subsystem and by CardBus services.
- * Returns 0 on success and -1 if unknown type of device (not normal, bridge
- * or CardBus).
+ * Returns 0 on success and negative if unknown type of device (not normal,
+ * bridge or CardBus).
  */
-static int pci_setup_device(struct pci_dev * dev)
+int pci_setup_device(struct pci_dev *dev)
 {
u32 class;
+   u8 hdr_type;
+   struct pci_slot *slot;
+
+   if (pci_read_config_byte(dev, PCI_HEADER_TYPE, hdr_type))
+   return -EIO;
+
+   dev-sysdata = dev-bus-sysdata;
+   dev-dev.parent = dev-bus-bridge;
+   dev-dev.bus = pci_bus_type;
+   dev-hdr_type = hdr_type  0x7f;
+   dev-multifunction = !!(hdr_type  0x80);
+   dev-cfg_size = pci_cfg_space_size(dev);
+   dev-error_state = pci_channel_io_normal;
+   set_pcie_port_type(dev);
+
+   list_for_each_entry(slot, dev-bus-slots, list)
+   if (PCI_SLOT(dev-devfn) == slot-number)
+   dev-slot = slot;
+
+   /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
+  set this higher, assuming the system even supports it.  */
+   dev-dma_mask = 0x;
 
dev_set_name(dev-dev, %04x:%02x:%02x.%d, pci_domain_nr(dev-bus),
 dev-bus-number, PCI_SLOT(dev-devfn),
@@ -708,7 +743,6 @@ static int pci_setup_device(struct pci_dev * dev)
 
/* Early fixups, before probing the BARs */
pci_fixup_device(pci_fixup_early, dev);
-   class = dev-class  8;
 
switch (dev-hdr_type) {/* header type */
case PCI_HEADER_TYPE_NORMAL:/* standard header */
@@ -770,7 +804,7 @@ static int pci_setup_device(struct pci_dev * dev)
default:/* unknown header */
dev_err(dev-dev, unknown header type %02x, 
ignoring device\n, dev-hdr_type);
-   return -1;
+   return -EIO;
 
bad:
dev_err(dev-dev, ignoring class %02x (doesn't match header 
@@ -804,19 +838,6 @@ static void pci_release_dev(struct device *dev)
kfree(pci_dev);
 }
 
-static void set_pcie_port_type(struct pci_dev *pdev)
-{
-   int pos;
-   u16 reg16;
-
-   pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-   if (!pos)
-   return;
-   pdev-is_pcie = 1;
-   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16);
-   pdev-pcie_type = (reg16  PCI_EXP_FLAGS_TYPE)  4;
-}
-
 /**
  * pci_cfg_space_size - get the configuration space size of the PCI device.
  * @dev: PCI device
@@ -892,9 +913,7 @@ EXPORT_SYMBOL(alloc_pci_dev);
 static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
 {
struct pci_dev *dev;
-   struct pci_slot *slot;
u32 l;
-   u8 hdr_type;
int delay = 1;
 
if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
@@ -921,34 +940,16 @@ static struct pci_dev *pci_scan_device(struct pci_bus 
*bus, int devfn)
}
}
 
-   if (pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, hdr_type))
-   return NULL;
-
dev = alloc_pci_dev();
if (!dev)
return NULL;
 
dev-bus = bus;
-   dev-sysdata = bus-sysdata

[PATCH v11 3/8] PCI: reserve bus range for SR-IOV device

2009-03-11 Thread Yu Zhao

Reserve the bus number range used by the Virtual Function when
pcibios_assign_all_busses() returns true.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |   36 
 drivers/pci/pci.h   |5 +
 drivers/pci/probe.c |3 +++
 3 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 8df2246..fb8fab1 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -14,6 +14,18 @@
 #include pci.h
 
 
+static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+{
+   return dev-bus-number + ((dev-devfn + dev-sriov-offset +
+   dev-sriov-stride * id)  8);
+}
+
+static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+{
+   return (dev-devfn + dev-sriov-offset +
+   dev-sriov-stride * id)  0xff;
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
int i;
@@ -209,3 +221,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
if (dev-is_physfn)
sriov_restore_state(dev);
 }
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+   int max = 0;
+   u8 busnr;
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   if (!dev-is_physfn)
+   continue;
+   busnr = virtfn_bus(dev, dev-sriov-total - 1);
+   if (busnr  max)
+   max = busnr;
+   }
+
+   return max ? max - bus-number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index efd79a2..7abdef6 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev 
*dev, int resno,
 static inline void pci_restore_iov_state(struct pci_dev *dev)
 {
 }
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+   return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus 
*bus)
for (devfn = 0; devfn  0x100; devfn += 8)
pci_scan_slot(bus, devfn);
 
+   /* Reserve buses for SR-IOV capability. */
+   max += pci_iov_bus_range(bus);
+
/*
 * After performing arch-dependent fixup of the bus, look behind
 * all PCI-to-PCI bridges on this bus.
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 6/8] PCI: handle SR-IOV Virtual Function Migration

2009-03-11 Thread Yu Zhao

Add or remove a Virtual Function after receiving a Migrate In or Out
Request.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  119 +++
 drivers/pci/pci.h   |4 ++
 include/linux/pci.h |6 +++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0a3af12..213fb61 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -179,6 +179,97 @@ static void virtfn_remove(struct pci_dev *dev, int id, int 
reset)
pci_dev_put(dev);
 }
 
+static int sriov_migration(struct pci_dev *dev)
+{
+   u16 status;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (!iov-nr_virtfn)
+   return 0;
+
+   if (!(iov-cap  PCI_SRIOV_CAP_VFM))
+   return 0;
+
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_STATUS, status);
+   if (!(status  PCI_SRIOV_STATUS_VFM))
+   return 0;
+
+   schedule_work(iov-mtask);
+
+   return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+   int i;
+   u8 state;
+   u16 status;
+   struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+   for (i = iov-initial; i  iov-nr_virtfn; i++) {
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_MI) {
+   writeb(PCI_SRIOV_VFM_AV, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 1);
+   } else if (state == PCI_SRIOV_VFM_MO) {
+   virtfn_remove(iov-self, i, 1);
+   writeb(PCI_SRIOV_VFM_UA, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 0);
+   }
+   }
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   status = ~PCI_SRIOV_STATUS_VFM;
+   pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+   int bir;
+   u32 table;
+   resource_size_t pa;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (nr_virtfn = iov-initial)
+   return 0;
+
+   pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table);
+   bir = PCI_SRIOV_VFM_BIR(table);
+   if (bir  PCI_STD_RESOURCE_END)
+   return -EIO;
+
+   table = PCI_SRIOV_VFM_OFFSET(table);
+   if (table + nr_virtfn  pci_resource_len(dev, bir))
+   return -EIO;
+
+   pa = pci_resource_start(dev, bir) + table;
+   iov-mstate = ioremap(pa, nr_virtfn);
+   if (!iov-mstate)
+   return -ENOMEM;
+
+   INIT_WORK(iov-mtask, sriov_migration_task);
+
+   iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev-sriov;
+
+   iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   cancel_work_sync(iov-mtask);
+   iounmap(iov-mstate);
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -261,6 +352,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
goto failed;
}
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM) {
+   rc = sriov_enable_migration(dev, nr_virtfn);
+   if (rc)
+   goto failed;
+   }
+
kobject_uevent(dev-dev.kobj, KOBJ_CHANGE);
iov-nr_virtfn = nr_virtfn;
 
@@ -290,6 +387,9 @@ static void sriov_disable(struct pci_dev *dev)
if (!iov-nr_virtfn)
return;
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM)
+   sriov_disable_migration(dev);
+
for (i = 0; i  iov-nr_virtfn; i++)
virtfn_remove(dev, i, 0);
 
@@ -559,3 +659,22 @@ void pci_disable_sriov(struct pci_dev *dev)
sriov_disable(dev);
 }
 EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+   if (!dev-is_physfn)
+   return IRQ_NONE;
+
+   return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1bdace3

[PATCH v11 7/8] PCI: document SR-IOV sysfs entries

2009-03-11 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/ABI/testing/sysfs-bus-pci |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index e638e15..36edf03 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -52,3 +52,30 @@ Description:
that some devices may have malformatted data.  If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What:  /sys/bus/pci/devices/.../virtfnN
+Date:  March 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbolic link appears when hardware supports the SR-IOV
+   capability and the Physical Function driver has enabled it.
+   The symbolic link points to the PCI device sysfs entry of the
+   Virtual Function whose index is N (0...MaxVFs-1).
+
+What:  /sys/bus/pci/devices/.../dep_link
+Date:  March 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbolic link appears when hardware supports the SR-IOV
+   capability and the Physical Function driver has enabled it,
+   and this device has vendor specific dependencies with others.
+   The symbolic link points to the PCI device sysfs entry of
+   Physical Function this device depends on.
+
+What:  /sys/bus/pci/devices/.../physfn
+Date:  March 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbolic link appears when a device is a Virtual Function.
+   The symbolic link points to the PCI device sysfs entry of the
+   Physical Function this device associates with.
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 8/8] PCI: manual for SR-IOV user and driver developer

2009-03-11 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/DocBook/kernel-api.tmpl |1 +
 Documentation/PCI/pci-iov-howto.txt   |   99 +
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl 
b/Documentation/DocBook/kernel-api.tmpl
index bc962cd..58c1945 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c
 --
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
  /sect1
  sect1titlePCI Hotplug Support Library/title
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt 
b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+   PCI Express I/O Virtualization Howto
+   Copyright (C) 2009 Intel Corporation
+   Yu Zhao yu.z...@intel.com
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+   'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+   void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+   irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id 
*id)
+{
+   pci_enable_sriov(dev, NR_VIRTFN);
+
+   ...
+
+   return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+   pci_disable_sriov(dev);
+
+   ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+   ...
+
+   return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+   ...
+
+   return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+   ...
+}
+
+static struct pci_driver dev_driver = {
+   .name = SR-IOV Physical Function driver,
+   .id_table = dev_id_table,
+   .probe =dev_probe,
+   .remove =   __devexit_p(dev_remove),
+   .suspend =  dev_suspend,
+   .resume =   dev_resume,
+   .shutdown = dev_shutdown,
+};
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 5/8] PCI: add SR-IOV API for Physical Function driver

2009-03-11 Thread Yu Zhao

Add or remove the Virtual Function when the SR-IOV is enabled or
disabled by the device driver. This can happen anytime rather than
only at the device probe stage.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  314 +++
 drivers/pci/pci.h   |2 +
 include/linux/pci.h |   19 +++-
 3 files changed, 334 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index fb8fab1..0a3af12 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -13,6 +13,7 @@
 #include linux/delay.h
 #include pci.h
 
+#define VIRTFN_ID_LEN  16
 
 static inline u8 virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -26,6 +27,284 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
dev-sriov-stride * id)  0xff;
 }
 
+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+   int rc;
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return bus;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   if (child)
+   return child;
+
+   child = pci_add_new_bus(bus, NULL, busnr);
+   if (!child)
+   return NULL;
+
+   child-subordinate = busnr;
+   child-dev.parent = bus-bridge;
+   rc = pci_bus_add_child(child);
+   if (rc) {
+   pci_remove_bus(child);
+   return NULL;
+   }
+
+   return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   BUG_ON(!child);
+
+   if (list_empty(child-devices))
+   pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   int i;
+   int rc;
+   u64 size;
+   char buf[VIRTFN_ID_LEN];
+   struct pci_dev *virtfn;
+   struct resource *res;
+   struct pci_sriov *iov = dev-sriov;
+
+   virtfn = alloc_pci_dev();
+   if (!virtfn)
+   return -ENOMEM;
+
+   mutex_lock(iov-dev-sriov-lock);
+   virtfn-bus = virtfn_add_bus(dev-bus, virtfn_bus(dev, id));
+   if (!virtfn-bus) {
+   kfree(virtfn);
+   mutex_unlock(iov-dev-sriov-lock);
+   return -ENOMEM;
+   }
+   virtfn-devfn = virtfn_devfn(dev, id);
+   virtfn-vendor = dev-vendor;
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device);
+   pci_setup_device(virtfn);
+   virtfn-dev.parent = dev-dev.parent;
+
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_IOV_RESOURCES + i;
+   if (!res-parent)
+   continue;
+   virtfn-resource[i].name = pci_name(virtfn);
+   virtfn-resource[i].flags = res-flags;
+   size = resource_size(res);
+   do_div(size, iov-total);
+   virtfn-resource[i].start = res-start + size * id;
+   virtfn-resource[i].end = virtfn-resource[i].start + size - 1;
+   rc = request_resource(res, virtfn-resource[i]);
+   BUG_ON(rc);
+   }
+
+   if (reset)
+   pci_execute_reset_function(virtfn);
+
+   pci_device_add(virtfn, virtfn-bus);
+   mutex_unlock(iov-dev-sriov-lock);
+
+   virtfn-physfn = pci_dev_get(dev);
+   virtfn-is_virtfn = 1;
+
+   rc = pci_bus_add_device(virtfn);
+   if (rc)
+   goto failed1;
+   sprintf(buf, virtfn%u, id);
+   rc = sysfs_create_link(dev-dev.kobj, virtfn-dev.kobj, buf);
+   if (rc)
+   goto failed1;
+   rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn);
+   if (rc)
+   goto failed2;
+
+   kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE);
+
+   return 0;
+
+failed2:
+   sysfs_remove_link(dev-dev.kobj, buf);
+failed1:
+   pci_dev_put(dev);
+   mutex_lock(iov-dev-sriov-lock);
+   pci_remove_bus_device(virtfn);
+   virtfn_remove_bus(dev-bus, virtfn_bus(dev, id));
+   mutex_unlock(iov-dev-sriov-lock);
+
+   return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+   char buf[VIRTFN_ID_LEN];
+   struct pci_bus *bus;
+   struct pci_dev *virtfn;
+   struct pci_sriov *iov = dev-sriov;
+
+   bus = pci_find_bus(pci_domain_nr(dev-bus), virtfn_bus(dev, id));
+   if (!bus)
+   return;
+
+   virtfn = pci_get_slot(bus, virtfn_devfn(dev, id));
+   if (!virtfn)
+   return;
+
+   pci_dev_put(virtfn);
+
+   if (reset) {
+   device_release_driver(virtfn-dev);
+   pci_execute_reset_function(virtfn);
+   }
+
+   sprintf(buf, virtfn%u, id);
+   sysfs_remove_link(dev-dev.kobj, buf);
+   sysfs_remove_link(virtfn-dev.kobj, physfn);
+
+   mutex_lock(iov-dev-sriov-lock

Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 04:08:10AM +0800, Matthew Wilcox wrote:
 On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
  +config PCI_IOV
  +   bool PCI IOV support
  +   depends on PCI
  +   select PCI_MSI
 
 My understanding is that having 'select' of a config symbol that the
 user can choose is bad.  I think we should probably make this 'depends
 on PCI_MSI'.
 
 PCI MSI can also be disabled at runtime (and Fedora do by default).
 Since SR-IOV really does require MSI, we need to put in a runtime check
 to see if pci_msi_enabled() is false.

Actually the SR-IOV doesn't really depend on the MSI (e.g. hardware doesn't
implement interrupt at all), but in most case the SR-IOV needs the MSI. The
selection is intended to make life easier. Anyway I'll remove it if people
want more flexibility (and possibility to break the PF driver).

 We don't depend on PCIEPORTBUS (a horribly named symbol).  Should we?
 SR-IOV is only supported for PCI Express machines.  I'm not sure of the
 right answer here, but I thought I should raise the question.

I think we don't need PCIe port bus framework. My understanding is it's for
those capabilities that want to share resources of the PCIe capability.

  +   default n
 
 You don't need this -- the default default is n ;-)
 
  +   help
  + PCI-SIG I/O Virtualization (IOV) Specifications support.
  + Single Root IOV: allows the Physical Function driver to enable
  + the hardware capability, so the Virtual Function is accessible
  + via the PCI Configuration Space using its own Bus, Device and
  + Function Numbers. Each Virtual Function also has the PCI Memory
  + Space to map the device specific register set.
 
 I'm not convinced this is the most helpful we could be to the user who's
 configuring their own kernel.  How about something like this?  (Randy, I
 particularly look to you to make my prose less turgid).
 
   help
 IO Virtualisation is a PCI feature supported by some devices
 which allows you to create virtual PCI devices and assign them
 to guest OSes.  This option needs to be selected in the host
 or Dom0 kernel, but does not need to be selected in the guest
 or DomU kernel.  If you don't know whether your hardware supports
 it, you can check by using lspci to look for the SR-IOV capability.
 
 If you have no idea what any of that means, it is safe to
 answer 'N' here.
 
  diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
  index 3d07ce2..ba99282 100644
  --- a/drivers/pci/Makefile
  +++ b/drivers/pci/Makefile
  @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
   
   obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
   
  +# PCI IOV support
  +obj-$(CONFIG_PCI_IOV) += iov.o
 
 I see you're following the gerneal style in this file, but the comments
 really add no value.  I should send a patch to take out the existing ones.
 
  +   list_for_each_entry(pdev, dev-bus-devices, bus_list)
  +   if (pdev-sriov)
  +   break;
  +   if (list_empty(dev-bus-devices) || !pdev-sriov)
  +   pdev = NULL;
  +   ctrl = 0;
  +   if (!pdev  pci_ari_enabled(dev-bus))
  +   ctrl |= PCI_SRIOV_CTRL_ARI;
  +
 
 I don't like this loop.  At the end of a list_for_each_entry() loop,
 pdev will not be pointing at a pci_device, it'll be pointing to some
 offset from dev-bus-devices.  So checking pdev-sriov at this point
 is really, really bad.  I would prefer to see something like this:
 
 ctrl = 0;
 list_for_each_entry(pdev, dev-bus-devices, bus_list) {
 if (pdev-sriov)
 goto ari_enabled;
 }
 
 if (pci_ari_enabled(dev-bus))
 ctrl = PCI_SRIOV_CTRL_ARI;
  ari_enabled:
 pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);

I guess I should put some comments here. What I want to do is to find the
lowest numbered PF (pdev) if it exists. It has ARI Capable Hierarchy bit,
as you have figured out, and it also keeps the VF bus lock. The lock is
for those VFs who belong to different PFs within a SR-IOV device and reside
on different bus (virtual) than PF's. When the PF driver enables/disables
the SR-IOV of a PF (this may happen anytime, not only at the driver probe
stage), the virtual VF bus will be allocated if it hasn't been allocated
yet. The lock guards the VF bus allocation between different PFs whose VFs
share the VF bus.

  +   if (pdev)
  +   iov-pdev = pci_dev_get(pdev);
  +   else {
  +   iov-pdev = dev;
  +   mutex_init(iov-lock);
  +   }
 
 Now I'm confused.  Why don't we need to init the mutex if there's another
 device on the same bus which also has an iov capability?

Yes, that's what it means :-)

  +static void sriov_release(struct pci_dev *dev)
  +{
  +   if (dev == dev-sriov-pdev)
  +   mutex_destroy(dev-sriov-lock);
  +   else
  +   pci_dev_put(dev-sriov-pdev);
  +
  +   kfree(dev-sriov

Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 04:20:24AM +0800, Matthew Wilcox wrote:
 On Fri, Feb 20, 2009 at 02:54:44PM +0800, Yu Zhao wrote:
  +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
  *devfn)
  +{
  +   u16 bdf;
  +
  +   bdf = (dev-bus-number  8) + dev-devfn +
  + dev-sriov-offset + dev-sriov-stride * id;
  +   *busnr = bdf  8;
  +   *devfn = bdf  0xff;
  +}
 
 I find the interface here a bit clunky -- a function returning void
 while having two OUT parameters.  How about this variation on the theme
 (viewers are encouraged to come up with their own preferred
 implementations and interfaces):
 
 static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id)
 {
   return (dev-bus-number  8) + dev-devfn + dev-sriov-offset +
   dev-sriov-stride * id;
 }
 
 #define VIRT_BUS(dev, id) (virtfn_bdf(dev, id)  8)
 #define VIRT_DEVFN(dev, id)   (virtfn_bdf(dev, id)  0xff)
 
 We rely on GCC to do CSE and not actually invoke virtfn_bdf more than
 once.

Yes, that's a good idea. Will replace that function with macros.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 04:37:18AM +0800, Matthew Wilcox wrote:
 On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
  +   virtfn-sysdata = dev-bus-sysdata;
  +   virtfn-dev.parent = dev-dev.parent;
  +   virtfn-dev.bus = dev-dev.bus;
  +   virtfn-devfn = devfn;
  +   virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL;
  +   virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE;
  +   virtfn-error_state = pci_channel_io_normal;
  +   virtfn-current_state = PCI_UNKNOWN;
  +   virtfn-is_pcie = 1;
  +   virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT;
  +   virtfn-dma_mask = 0x;
  +   virtfn-vendor = dev-vendor;
  +   virtfn-subsystem_vendor = dev-subsystem_vendor;
  +   virtfn-class = dev-class;
 
 There seems to be a certain amount of commonality between this and
 pci_scan_device().  Have you considered trying to make a common helper
 function, or does it not work out well?

It's doable. Will enhance the pci_setup_device and use it to setup the VF.

  +   pci_device_add(virtfn, virtfn-bus);
 
 Greg is probably going to ding you here for adding the device, then
 creating the symlinks.  I believe it's now best practice to create the
 symlinks first, so there's no window where userspace can get confused.

Yes, but unfortunately we can't create links before adding a device.
I double checked device_add(), there is no place for those links to be
created before it sends uevent. So for now, we have to trigger another
uevent for those links.

  +   mutex_unlock(iov-pdev-sriov-lock);
 
 I question the existance of this mutex now.  What's it protecting?
 
 Aren't we going to be implicitly protected by virtue of the Physical
 Function device driver being the only one calling this function, and the
 driver will be calling it from the -probe routine which is not called
 simultaneously for the same device.

The PF driver patches I listed before support dynamical enabling/disabling
of the SR-IOV through sysfs interface. So we have to protect the VF bus
allocation as I explained before.

  +   virtfn-physfn = pci_dev_get(dev);
  +
  +   rc = pci_bus_add_device(virtfn);
  +   if (rc)
  +   goto failed1;
  +   sprintf(buf, %d, id);
 
 %u, perhaps?  And maybe 'id' should always be unsigned?  Just a thought.

Yes, will replace %d to %u.

  +   rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf);
  +   if (rc)
  +   goto failed1;
  +   rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn);
  +   if (rc)
  +   goto failed2;
 
 I'm glad to see these symlinks documented in later patches!
 
  +   nres = 0;
  +   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
  +   res = dev-resource + PCI_SRIOV_RESOURCES + i;
  +   if (!res-parent)
  +   continue;
  +   nres++;
  +   }
 
 Can't this be written more simply as:
 
   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
   res = dev-resource + PCI_SRIOV_RESOURCES + i;
   if (res-parent)
   nres++;
   }

Yes, will do

 ?
 
  +   if (nres != iov-nres) {
  +   dev_err(dev-dev, no enough MMIO for SR-IOV\n);
  +   return -ENOMEM;
  +   }
 
 Randy, can you help us out with better wording here?
 
  +   dev_err(dev-dev, no enough bus range for SR-IOV\n);
 
 and here.
 
  +   if (iov-link != dev-devfn) {
  +   rc = -ENODEV;
  +   list_for_each_entry(link, dev-bus-devices, bus_list) {
  +   if (link-sriov  link-devfn == iov-link)
  +   rc = sysfs_create_link(iov-dev.kobj,
  +   link-dev.kobj, dep_link);
 
 I skipped to the end and read patch 7/7 and I still don't understand
 what dep_link is for.  Can you explain please?  In particular, how is it
 different from physfn?

It's defined by spec as:

3.3.8. Function Dependency Link (12h)
The programming model for a Device may have vendor specific dependencies
between sets of Functions. The Function Dependency Link field is used to
describe these dependencies. This field describes dependencies between PFs.
VF dependencies are the same as the dependencies of their associated PFs.
If a PF is independent from other PFs of a Device, this field shall
contain its own Function Number. If a PF is dependent on other PFs of a
Device, this field shall contain the Function Number of the next PF in
the same Function Dependency List. The last PF in a Function Dependency
List shall contain the Function Number of the first PF in the Function
Dependency List. If PF p and PF q are in the same Function Dependency
List, than any SI that is assigned VF p,n shall also be assigned to VF q,n.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 05:13:41AM +0800, Matthew Wilcox wrote:
 On Fri, Feb 20, 2009 at 02:54:46PM +0800, Yu Zhao wrote:
  +static int sriov_migration(struct pci_dev *dev)
  +{
  +   u16 status;
  +   struct pci_sriov *iov = dev-sriov;
  +
  +   if (!iov-nr_virtfn)
  +   return 0;
  +
  +   if (!(iov-cap  PCI_SRIOV_CAP_VFM))
  +   return 0;
  +
  +   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
 
 You passed in dev here, you don't need to use iov-self, right?

Will do.

  +   if (!(status  PCI_SRIOV_STATUS_VFM))
  +   return 0;
  +
  +   schedule_work(iov-mtask);
  +
  +   return 1;
  +}
 
  +/**
  + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
  + * @dev: the PCI device
  + *
  + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
  + *
  + * Physical Function driver is responsible to register IRQ handler using
  + * VF Migration Interrupt Message Number, and call this function when the
  + * interrupt is generated by the hardware.
  + */
  +irqreturn_t pci_sriov_migration(struct pci_dev *dev)
  +{
  +   if (!dev-sriov)
  +   return IRQ_NONE;
  +
  +   return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
  +}
  +EXPORT_SYMBOL_GPL(pci_sriov_migration);
 
 OK, I think I get it -- you've basically written an interrupt handler
 for the driver to call from its interrupt handler.  Am I right in
 thinking that the reason the driver needs to do the interrupt handler
 here is because we don't currently have an interface that looks like:
 
 int pci_get_msix_interrupt(struct pci_dev *dev, unsigned vector);
 
 ?  If so, we should probably add it; I want it for my MSI-X rewrite
 anyway.

Right, we really need this function. But I guess we still have to keep the
handler in case the PF only has MSI, right?

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver

2009-03-09 Thread Yu Zhao

Thanks a lot, Randy!

On Sat, Mar 07, 2009 at 05:48:33AM +0800, Randy Dunlap wrote:
 Matthew Wilcox wrote:
  On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
  
  +  if (nres != iov-nres) {
  +  dev_err(dev-dev, no enough MMIO for SR-IOV\n);
  +  return -ENOMEM;
  +  }
 
   not enough MMIO BARs for SR-IOV
   or
   not enough MMIO resources for SR-IOV
   or
   too few MMIO BARs for SR-IOV
 ?
 
  Randy, can you help us out with better wording here?
  
  +  dev_err(dev-dev, no enough bus range for SR-IOV\n);
  
  and here.
 
   SR-IOV: bus number too large
   or
   SR-IOV: bus number out of range
   or
   SR-IOV: cannot allocate valid bus number
 ?
 
  +  if (iov-link != dev-devfn) {
  +  rc = -ENODEV;
  +  list_for_each_entry(link, dev-bus-devices, bus_list) {
  +  if (link-sriov  link-devfn == iov-link)
  +  rc = sysfs_create_link(iov-dev.kobj,
  +  link-dev.kobj, dep_link);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 10:34:54AM +0800, Greg KH wrote:
 On Fri, Mar 06, 2009 at 12:44:11PM -0700, Matthew Wilcox wrote:
   Physical Function driver patches for Intel 82576 NIC are available:
 http://patchwork.kernel.org/patch/8063/
 http://patchwork.kernel.org/patch/8064/
 http://patchwork.kernel.org/patch/8065/
 http://patchwork.kernel.org/patch/8066/
  
  I need to review this driver; I haven't done that yet.  Has anyone else?
 
 The driver was rejected by the upstream developers, who said it would
 never be accepted.

Sorry I didn't make it clear. These Physical Function driver patches are
new ones that have been accepted by David Miller (net-next-2.6).
The old ones I sent last time are for demonstration purpose, and won't
be in any upstream trees.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability

2009-03-09 Thread Yu Zhao

On Sat, Mar 07, 2009 at 10:38:45AM +0800, Greg KH wrote:
 On Fri, Mar 06, 2009 at 01:08:10PM -0700, Matthew Wilcox wrote:
  On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
   + list_for_each_entry(pdev, dev-bus-devices, bus_list)
   + if (pdev-sriov)
   + break;
   + if (list_empty(dev-bus-devices) || !pdev-sriov)
   + pdev = NULL;
   + ctrl = 0;
   + if (!pdev  pci_ari_enabled(dev-bus))
   + ctrl |= PCI_SRIOV_CTRL_ARI;
   +
  
  I don't like this loop.  At the end of a list_for_each_entry() loop,
  pdev will not be pointing at a pci_device, it'll be pointing to some
  offset from dev-bus-devices.  So checking pdev-sriov at this point
  is really, really bad.  I would prefer to see something like this:
  
  ctrl = 0;
  list_for_each_entry(pdev, dev-bus-devices, bus_list) {
  if (pdev-sriov)
  goto ari_enabled;
  }
  
  if (pci_ari_enabled(dev-bus))
  ctrl = PCI_SRIOV_CTRL_ARI;
   ari_enabled:
  pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
 
 No, please use bus_for_each_dev() instead, or bus_find_device(), don't
 walk the bus list by hand.  I'm kind of surprised that even builds.  Hm,
 in looking at the 2.6.29-rc kernels, I notice it will not even build at
 all, you are now forced to use those functions, which is good.

The devices haven't been added at this time, so we can't use
bus_for_each_dev(). I guess that's why the `bus-devices' exists, and
actually pci_bus_add_devices() walks the bus list same way to retrieve
the devices and add them.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver

2009-03-09 Thread Yu Zhao

On Tue, Mar 10, 2009 at 03:39:01AM +0800, Greg KH wrote:
 On Mon, Mar 09, 2009 at 04:25:05PM +0800, Yu Zhao wrote:
+   pci_device_add(virtfn, virtfn-bus);
   
   Greg is probably going to ding you here for adding the device, then
   creating the symlinks.  I believe it's now best practice to create the
   symlinks first, so there's no window where userspace can get confused.
  
  Yes, but unfortunately we can't create links before adding a device.
  I double checked device_add(), there is no place for those links to be
  created before it sends uevent. So for now, we have to trigger another
  uevent for those links.
 
 What exactly are you trying to do with a symlink here that you need to
 do it this way?  I vaguely remember you mentioning this in the past, but
 I thought you had dropped the symlinks after our conversation about this
 very problem.

I'd like to create some symlinks to reflect the relationship between
Physical Function and its associated Virtual Functions. The Physical
Function is like a master device that controls the allocation of its
Virtual Functions and owns the device physical resource. The Virtual
Functions are like slave devices of the Physical Function. For example,
if 01:00.0 is a Physical Function and 02:00.0 is a Virtual Function
associated with 01:00.0. Then the symlinks (virtfnN and physfn) would
look like:

  $ ls -l /sys/bus/pci/devices/:01:00.0/
  ...
  ...  virtfn0 - ../:02:00.0
  ...  virtfn1 - ../:02:00.1
  ...  virtfn2 - ../:02:00.2
  ...

  $ ls -l /sys/bus/pci/devices/:02:00.0/
  ...
  ... physfn - ../:01:00.0
  ...

This is very useful for userspace applications, both KVM and Xen need
to know this kind of relationship so they can request the permission
from a Physical Function before using its associated Virtual Functions.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/6] ATS capability support for Intel IOMMU

2009-02-25 Thread Yu Zhao

On Sun, Feb 15, 2009 at 06:59:10AM +0800, Grant Grundler wrote:
 On Thu, Feb 12, 2009 at 08:50:32PM +0800, Yu Zhao wrote:
  This patch series implements Address Translation Service support for
  the Intel IOMMU. ATS makes the PCI Endpoint be able to request the
  DMA address translation from the IOMMU and cache the translation in
  the Endpoint, thus alleviate IOMMU pressure and improve the hardware
  performance in the I/O virtualization environment.
  
  
  Changelog: v2 - v3
1, throw error message if VT-d hardware detects invalid descriptor
   on Queued Invalidation interface (David Woodhouse)
2, avoid using pci_find_ext_capability every time when reading ATS
   Invalidate Queue Depth (Matthew Wilcox)
  Changelog: v1 - v2
added 'static' prefix to a local LIST_HEAD (Andrew Morton)
  
  
  Yu Zhao (6):
PCI: support the ATS capability
VT-d: parse ATSR in DMA Remapping Reporting Structure
VT-d: add queue invalidation fault status support
VT-d: add device IOTLB invalidation support
VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
VT-d: support the device IOTLB
  
   drivers/pci/dmar.c   |  230 
  ++
 
 Yu,
 Can you please add something to Documentation/PCI/pci.txt?
 New API I'm seeing are:
 +extern int pci_enable_ats(struct pci_dev *dev, int ps);
 +extern void pci_disable_ats(struct pci_dev *dev);
 +extern int pci_ats_queue_depth(struct pci_dev *dev);

Yes, I'll document these new API.

 Do these also need to be EXPORT_SYMBOL_GPL() as well?
 Or are drivers never expected to call the above?

PCI device driver shouldn't use these API, only IOMMU driver (can't be module)
would use them. Anyway it's a good idea to export them :-)

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 6/6] VT-d: support the device IOTLB

2009-02-25 Thread Yu Zhao

On Sun, Feb 15, 2009 at 07:20:52AM +0800, Grant Grundler wrote:
 On Thu, Feb 12, 2009 at 08:50:38PM +0800, Yu Zhao wrote:
  Support device IOTLB (i.e. ATS) for both native and KVM environments.
  +
  +static void iommu_enable_dev_iotlb(struct device_domain_info *info)
  +{
  +   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
  +}
 
 Why is a static function defined that calls a global function?

There would be some extra steps to do before VT-d enables ATS in the
future, so this wrapper makes code expandable later.

 
  +
  +static void iommu_disable_dev_iotlb(struct device_domain_info *info)
  +{
  +   if (info-dev  pci_ats_enabled(info-dev))
  +   pci_disable_ats(info-dev);
  +}
 
 ditto. pci_disable_ats() should be able to handle the case when
 info-dev is NULL and will know if ATS is enabled.

The info-dev could be NULL only because the VT-d code makes it so. AMD
an IBM IOMMU may not have this requirement. If we make pci_disable_ats()
accept NULL pci_dev, it would fail to catch some errors like using
pci_disable_ats without calling pci_enable_ats before.

 
 I think both of these functions can be dropped and just directly call 
 pci_*_ats().
 
  +
  +static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  + u64 addr, unsigned mask)
  +{
  +   int rc;
  +   u16 sid, qdep;
  +   unsigned long flags;
  +   struct device_domain_info *info;
  +
  +   spin_lock_irqsave(device_domain_lock, flags);
  +   list_for_each_entry(info, domain-devices, link) {
 
 Would it be possible to define a single domain for each PCI device?
 Or does domain represent an IOMMU?
 Sorry, I forgot...I'm sure someone has mentioned this the past.

A domain represents one translation mapping. For device used by the host,
there is one domain per device. Device assigned to a guest shares one
domain.

 
 I want to point out list_for_each_entry() is effectively a nested loop.
 iommu_flush_dev_iotlb() will get called alot from flush_unmaps().
 Perhaps do the lookup once there and pass that as a parameter?
 I don't know if that is feasible. But if this is a very frequently
 used code path, every CPU cycle counts.

iommu_flush_dev_iotlb() is only used to flush the devices used in the
host, which means there is always one entry on the list.

 
 
  +   if (!info-dev || !pci_ats_enabled(info-dev))
  +   continue;
  +
  +   sid = info-bus  8 | info-devfn;
  +   qdep = pci_ats_queue_depth(info-dev);
 
 re Matthew Wilcox's comment - looks like caching ats_queue_depth
 is appropriate.

Yes, it's cached as of v3.

  +   rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask);
  +   if (rc)
  +   printk(KERN_ERR IOMMU: flush device IOTLB failed\n);
 
 Can this be a dev_printk please?

Yes, will replace it with dev_err().

 Perhaps in general review the use of printk so when errors are reported,
 users will know which devices might be affected by the failure.
 If more than a few printk's should be converted to dev_printk(), I'd
 be happy if that were a seperate patch (submitted later).
 
 
  pr_debug(Set context mapping for %02x:%02x.%d\n,
  bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
  @@ -1534,7 +1608,11 @@ static int domain_context_mapping_one(struct 
  dmar_domain *domain,
  context_set_domain_id(context, id);
  context_set_address_width(context, iommu-agaw);
  context_set_address_root(context, virt_to_phys(pgd));
  -   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
  +   info = iommu_support_dev_iotlb(domain, bus, devfn);
  +   if (info)
  +   context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB);
  +   else
  +   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
 
 Would it be ok to rewrite this as:
 + context_set_translation_type(context,
 +  info ? CONTEXT_TT_DEV_IOTLB : CONTEXT_TT_MULTI_LEVEL);

Yes, this one looks better.

 
  context_set_fault_enable(context);
  context_set_present(context);
  domain_flush_cache(domain, context, sizeof(*context));
  @@ -1546,6 +1624,8 @@ static int domain_context_mapping_one(struct 
  dmar_domain *domain,
  iommu_flush_write_buffer(iommu);
  else
  iommu-flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_DSI_FLUSH, 0);
 
 Adding a blank line here would make this more readable.
 (AFAIK, not required by coding style, just my opinion.)

Yes, I prefer a bank line here too, somehow I missed it.

 
  +   if (info)
  +   iommu_enable_dev_iotlb(info);
 
 Could iommu_enable_dev_iotlb() (or pci_enable_ats()) check if info is NULL?
 Then this would just be a simple function call.
 
 And it would be consistent with usage of iommu_disable_dev_iotlb().

Yes, good idea.

Thanks a lot for reviewing it!
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support

2009-02-24 Thread Yu Zhao

On Tue, Feb 24, 2009 at 06:47:38PM +0800, Avi Kivity wrote:
 Yu Zhao wrote:
  Greetings,
 
  Following patches are intended to support SR-IOV capability in the
  Linux kernel. With these patches, people can turn a PCI device with
  the capability into multiple ones from software perspective, which
  will benefit KVM and achieve other purposes such as QoS, security,
  and etc.

 
 Do those patches allow using a VF on the host (in other words, does the 
 kernel emulate config space accesses)?

Yes, if a VF's driver is loaded in the host, the VF works the same way
as normal PCI device.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 3/7] PCI: reserve bus range for SR-IOV device

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |   34 ++
 drivers/pci/pci.h   |5 +
 drivers/pci/probe.c |3 +++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 3bca8f8..0b80437 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -14,6 +14,16 @@
 #include pci.h
 
 
+static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
+{
+   u16 bdf;
+
+   bdf = (dev-bus-number  8) + dev-devfn +
+ dev-sriov-offset + dev-sriov-stride * id;
+   *busnr = bdf  8;
+   *devfn = bdf  0xff;
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
int i;
@@ -208,3 +218,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
if (dev-sriov)
sriov_restore_state(dev);
 }
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+   int max = 0;
+   u8 busnr, devfn;
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   if (!dev-sriov)
+   continue;
+   virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn);
+   if (busnr  max)
+   max = busnr;
+   }
+
+   return max ? max - bus-number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b24c9e2..2cf32f5 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev 
*dev, int resno,
 static inline void pci_restore_iov_state(struct pci_dev *dev)
 {
 }
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+   return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus 
*bus)
for (devfn = 0; devfn  0x100; devfn += 8)
pci_scan_slot(bus, devfn);
 
+   /* Reserve buses for SR-IOV capability. */
+   max += pci_iov_bus_range(bus);
+
/*
 * After performing arch-dependent fixup of the bus, look behind
 * all PCI-to-PCI bridges on this bus.
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 0/7] PCI: Linux kernel SR-IOV support

2009-02-19 Thread Yu Zhao

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
  
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
  http://download.intel.com/design/network/ProdBrf/320025.pdf
  http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
  http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf

Physical Function driver patches for Intel 82576 NIC are available:
  http://patchwork.kernel.org/patch/8063/
  http://patchwork.kernel.org/patch/8064/
  http://patchwork.kernel.org/patch/8065/
  http://patchwork.kernel.org/patch/8066/

Major changes from v9 to v10:
  1, minor fix in pci_restore_iov_state().
  2, respin against the latest tree.

Yu Zhao (7):
  PCI: initialize and release SR-IOV capability
  PCI: restore saved SR-IOV state
  PCI: reserve bus range for SR-IOV device
  PCI: add SR-IOV API for Physical Function driver
  PCI: handle SR-IOV Virtual Function Migration
  PCI: document SR-IOV sysfs entries
  PCI: manual for SR-IOV user and driver developer

 Documentation/ABI/testing/sysfs-bus-pci |   27 ++
 Documentation/DocBook/kernel-api.tmpl   |1 +
 Documentation/PCI/pci-iov-howto.txt |   99 +
 drivers/pci/Kconfig |   13 +
 drivers/pci/Makefile|3 +
 drivers/pci/iov.c   |  711 +++
 drivers/pci/pci.c   |8 +
 drivers/pci/pci.h   |   53 +++
 drivers/pci/probe.c |7 +
 include/linux/pci.h |   28 ++
 include/linux/pci_regs.h|   33 ++
 11 files changed, 983 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt
 create mode 100644 drivers/pci/iov.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 2/7] PCI: restore saved SR-IOV state

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   29 +
 drivers/pci/pci.c |1 +
 drivers/pci/pci.h |4 
 3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e6736d4..3bca8f8 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -128,6 +128,25 @@ static void sriov_release(struct pci_dev *dev)
dev-sriov = NULL;
 }
 
+static void sriov_restore_state(struct pci_dev *dev)
+{
+   int i;
+   u16 ctrl;
+   struct pci_sriov *iov = dev-sriov;
+
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE)
+   return;
+
+   for (i = PCI_SRIOV_RESOURCES; i = PCI_SRIOV_RESOURCE_END; i++)
+   pci_update_resource(dev, i);
+
+   pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+   if (iov-ctrl  PCI_SRIOV_CTRL_VFE)
+   msleep(100);
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -179,3 +198,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
return dev-sriov-pos + PCI_SRIOV_BAR +
4 * (resno - PCI_SRIOV_RESOURCES);
 }
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+   if (dev-sriov)
+   sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2eba2a5..8e21912 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+   pci_restore_iov_state(dev);
 
return 0;
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 451db74..b24c9e2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, 
int resno,
 {
return 0;
 }
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 1/7] PCI: initialize and release SR-IOV capability

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/Kconfig  |   13 
 drivers/pci/Makefile |3 +
 drivers/pci/iov.c|  181 ++
 drivers/pci/pci.c|7 ++
 drivers/pci/pci.h|   37 ++
 drivers/pci/probe.c  |4 +
 include/linux/pci.h  |8 ++
 include/linux/pci_regs.h |   33 +
 8 files changed, 286 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..e8ea3e8 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,16 @@ config HT_IRQ
   This allows native hypertransport devices to use interrupts.
 
   If unsure say Y.
+
+config PCI_IOV
+   bool PCI IOV support
+   depends on PCI
+   select PCI_MSI
+   default n
+   help
+ PCI-SIG I/O Virtualization (IOV) Specifications support.
+ Single Root IOV: allows the Physical Function driver to enable
+ the hardware capability, so the Virtual Function is accessible
+ via the PCI Configuration Space using its own Bus, Device and
+ Function Numbers. Each Virtual Function also has the PCI Memory
+ Space to map the device specific register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba99282 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+# PCI IOV support
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 000..e6736d4
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,181 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include linux/pci.h
+#include linux/mutex.h
+#include linux/string.h
+#include linux/delay.h
+#include pci.h
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+   int i;
+   int rc;
+   int nres;
+   u32 pgsz;
+   u16 ctrl, total, offset, stride;
+   struct pci_sriov *iov;
+   struct resource *res;
+   struct pci_dev *pdev;
+
+   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
+   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE) {
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+   ssleep(1);
+   }
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total);
+   if (!total)
+   return 0;
+
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev-sriov)
+   break;
+   if (list_empty(dev-bus-devices) || !pdev-sriov)
+   pdev = NULL;
+
+   ctrl = 0;
+   if (!pdev  pci_ari_enabled(dev-bus))
+   ctrl |= PCI_SRIOV_CTRL_ARI;
+
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
+   if (!offset || (total  1  !stride))
+   return -EIO;
+
+   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
+   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
+   pgsz = ~((1  i) - 1);
+   if (!pgsz)
+   return -EIO;
+
+   pgsz = ~(pgsz - 1);
+   pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+   nres = 0;
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   i += __pci_read_base(dev, pci_bar_unknown, res,
+pos + PCI_SRIOV_BAR + i * 4);
+   if (!res-flags)
+   continue;
+   if (resource_size(res)  (PAGE_SIZE - 1)) {
+   rc = -EIO;
+   goto failed;
+   }
+   res-end = res-start + resource_size(res) * total - 1;
+   nres++;
+   }
+
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
+   iov-pos = pos;
+   iov-nres = nres;
+   iov-ctrl = ctrl;
+   iov-total = total;
+   iov-offset = offset;
+   iov-stride = stride;
+   iov-pgsz = pgsz;
+   iov-self = dev;
+   pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap);
+   pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link);
+
+   if (pdev)
+   iov-pdev = pci_dev_get(pdev);
+   else {
+   iov-pdev = dev

[PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  348 +++
 drivers/pci/pci.h   |3 +
 include/linux/pci.h |   14 ++
 3 files changed, 365 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0b80437..8096fc9 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -13,6 +13,8 @@
 #include linux/delay.h
 #include pci.h
 
+#define VIRTFN_ID_LEN  8
+
 
 static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
 {
@@ -24,6 +26,319 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, 
u8 *busnr, u8 *devfn)
*devfn = bdf  0xff;
 }
 
+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+   int rc;
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return bus;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   if (child)
+   return child;
+
+   child = pci_add_new_bus(bus, NULL, busnr);
+   if (!child)
+   return NULL;
+
+   child-subordinate = busnr;
+   child-dev.parent = bus-bridge;
+   rc = pci_bus_add_child(child);
+   if (rc) {
+   pci_remove_bus(child);
+   return NULL;
+   }
+
+   return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   BUG_ON(!child);
+
+   if (list_empty(child-devices))
+   pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   int i;
+   int rc;
+   u64 size;
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct pci_dev *virtfn;
+   struct resource *res;
+   struct pci_sriov *iov = dev-sriov;
+
+   virtfn = alloc_pci_dev();
+   if (!virtfn)
+   return -ENOMEM;
+
+   virtfn_bdf(dev, id, busnr, devfn);
+   mutex_lock(iov-pdev-sriov-lock);
+   virtfn-bus = virtfn_add_bus(dev-bus, busnr);
+   if (!virtfn-bus) {
+   kfree(virtfn);
+   mutex_unlock(iov-pdev-sriov-lock);
+   return -ENOMEM;
+   }
+
+   virtfn-sysdata = dev-bus-sysdata;
+   virtfn-dev.parent = dev-dev.parent;
+   virtfn-dev.bus = dev-dev.bus;
+   virtfn-devfn = devfn;
+   virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL;
+   virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+   virtfn-error_state = pci_channel_io_normal;
+   virtfn-current_state = PCI_UNKNOWN;
+   virtfn-is_pcie = 1;
+   virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT;
+   virtfn-dma_mask = 0x;
+   virtfn-vendor = dev-vendor;
+   virtfn-subsystem_vendor = dev-subsystem_vendor;
+   virtfn-class = dev-class;
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device);
+   pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision);
+   pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
+virtfn-subsystem_device);
+
+   dev_set_name(virtfn-dev, %04x:%02x:%02x.%d,
+pci_domain_nr(virtfn-bus), busnr,
+PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   if (!res-parent)
+   continue;
+   virtfn-resource[i].name = pci_name(virtfn);
+   virtfn-resource[i].flags = res-flags;
+   size = resource_size(res);
+   do_div(size, iov-total);
+   virtfn-resource[i].start = res-start + size * id;
+   virtfn-resource[i].end = virtfn-resource[i].start + size - 1;
+   rc = request_resource(res, virtfn-resource[i]);
+   BUG_ON(rc);
+   }
+
+   if (reset)
+   pci_execute_reset_function(virtfn);
+
+   pci_device_add(virtfn, virtfn-bus);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   virtfn-physfn = pci_dev_get(dev);
+
+   rc = pci_bus_add_device(virtfn);
+   if (rc)
+   goto failed1;
+   sprintf(buf, %d, id);
+   rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf);
+   if (rc)
+   goto failed1;
+   rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn);
+   if (rc)
+   goto failed2;
+
+   kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE);
+
+   return 0;
+
+failed2:
+   sysfs_remove_link(iov-dev.kobj, buf);
+failed1:
+   pci_dev_put(dev);
+   mutex_lock(iov-pdev-sriov-lock);
+   pci_remove_bus_device(virtfn);
+   virtfn_remove_bus(dev-bus, busnr);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct

[PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  119 +++
 drivers/pci/pci.h   |4 ++
 include/linux/pci.h |6 +++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 8096fc9..063fe74 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -206,6 +206,97 @@ static void sriov_release_dev(struct device *dev)
iov-nr_virtfn = 0;
 }
 
+static int sriov_migration(struct pci_dev *dev)
+{
+   u16 status;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (!iov-nr_virtfn)
+   return 0;
+
+   if (!(iov-cap  PCI_SRIOV_CAP_VFM))
+   return 0;
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   if (!(status  PCI_SRIOV_STATUS_VFM))
+   return 0;
+
+   schedule_work(iov-mtask);
+
+   return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+   int i;
+   u8 state;
+   u16 status;
+   struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+   for (i = iov-initial; i  iov-nr_virtfn; i++) {
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_MI) {
+   writeb(PCI_SRIOV_VFM_AV, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 1);
+   } else if (state == PCI_SRIOV_VFM_MO) {
+   virtfn_remove(iov-self, i, 1);
+   writeb(PCI_SRIOV_VFM_UA, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 0);
+   }
+   }
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   status = ~PCI_SRIOV_STATUS_VFM;
+   pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+   int bir;
+   u32 table;
+   resource_size_t pa;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (nr_virtfn = iov-initial)
+   return 0;
+
+   pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table);
+   bir = PCI_SRIOV_VFM_BIR(table);
+   if (bir  PCI_STD_RESOURCE_END)
+   return -EIO;
+
+   table = PCI_SRIOV_VFM_OFFSET(table);
+   if (table + nr_virtfn  pci_resource_len(dev, bir))
+   return -EIO;
+
+   pa = pci_resource_start(dev, bir) + table;
+   iov-mstate = ioremap(pa, nr_virtfn);
+   if (!iov-mstate)
+   return -ENOMEM;
+
+   INIT_WORK(iov-mtask, sriov_migration_task);
+
+   iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev-sriov;
+
+   iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   cancel_work_sync(iov-mtask);
+   iounmap(iov-mstate);
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -294,6 +385,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
goto failed2;
}
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM) {
+   rc = sriov_enable_migration(dev, nr_virtfn);
+   if (rc)
+   goto failed2;
+   }
+
kobject_uevent(dev-dev.kobj, KOBJ_CHANGE);
iov-nr_virtfn = nr_virtfn;
 
@@ -325,6 +422,9 @@ static void sriov_disable(struct pci_dev *dev)
if (!iov-nr_virtfn)
return;
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM)
+   sriov_disable_migration(dev);
+
for (i = 0; i  iov-nr_virtfn; i++)
virtfn_remove(dev, i, 0);
 
@@ -590,3 +690,22 @@ void pci_disable_sriov(struct pci_dev *dev)
sriov_disable(dev);
 }
 EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+   if (!dev-sriov)
+   return IRQ_NONE;
+
+   return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9bbf868..6764f02 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1,6 +1,8 @@
 #ifndef

[PATCH v10 6/7] PCI: document SR-IOV sysfs entries

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/ABI/testing/sysfs-bus-pci |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..84dc100 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,30 @@ Description:
that some devices may have malformatted data.  If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What:  /sys/bus/pci/devices/.../virtfn/N
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it.
+   The symbol link points to the PCI device sysfs entry of
+   Virtual Function whose index is N (0...MaxVFs-1).
+
+What:  /sys/bus/pci/devices/.../virtfn/dep_link
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it,
+   and this device has vendor specific dependencies with
+   others. The symbol link points to the PCI device sysfs
+   entry of Physical Function this device depends on.
+
+What:  /sys/bus/pci/devices/.../physfn
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when a device is Virtual Function.
+   The symbol link points to the PCI device sysfs entry of
+   Physical Function this device associates with.
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer

2009-02-19 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/DocBook/kernel-api.tmpl |1 +
 Documentation/PCI/pci-iov-howto.txt   |   99 +
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl 
b/Documentation/DocBook/kernel-api.tmpl
index 5818ff7..506e611 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
 --
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
  /sect1
  sect1titlePCI Hotplug Support Library/title
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt 
b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+   PCI Express I/O Virtualization Howto
+   Copyright (C) 2009 Intel Corporation
+   Yu Zhao yu.z...@intel.com
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+   'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+   void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+   irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id 
*id)
+{
+   pci_enable_sriov(dev, NR_VIRTFN);
+
+   ...
+
+   return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+   pci_disable_sriov(dev);
+
+   ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+   ...
+
+   return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+   ...
+
+   return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+   ...
+}
+
+static struct pci_driver dev_driver = {
+   .name = SR-IOV Physical Function driver,
+   .id_table = dev_id_table,
+   .probe =dev_probe,
+   .remove =   __devexit_p(dev_remove),
+   .suspend =  dev_suspend,
+   .resume =   dev_resume,
+   .shutdown = dev_shutdown,
+};
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] VT-d: enable DMAR on 32-bit kernel

2009-02-17 Thread Yu Zhao

From: David Woodhouse dw...@infradead.org

If we fix a few highmem-related thinkos and a couple of printk format
warnings, the Intel IOMMU driver works fine in a 32-bit kernel.

--
Fixed end address roundup problem in dma_pte_clear_range().

Tested both 32 and 32 PAE modes on Intel X58 and Q35 platforms.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 arch/x86/Kconfig  |2 +-
 drivers/pci/intel-iommu.c |   24 +++-
 2 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9c39095..9e9ac5c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1794,7 +1794,7 @@ config PCI_MMCONFIG
 
 config DMAR
bool Support for DMA Remapping Devices (EXPERIMENTAL)
-   depends on X86_64  PCI_MSI  ACPI  EXPERIMENTAL
+   depends on PCI_MSI  ACPI  EXPERIMENTAL
help
  DMA remapping (DMAR) devices support enables independent address
  translations for Direct Memory Access (DMA) from devices.
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f4b7c79..03bc0e5 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -687,15 +687,17 @@ static void dma_pte_clear_one(struct dmar_domain *domain, 
u64 addr)
 static void dma_pte_clear_range(struct dmar_domain *domain, u64 start, u64 end)
 {
int addr_width = agaw_to_width(domain-agaw);
+   int npages;
 
start = (((u64)1)  addr_width) - 1;
end = (((u64)1)  addr_width) - 1;
/* in case it's partial page */
start = PAGE_ALIGN(start);
end = PAGE_MASK;
+   npages = (end - start) / VTD_PAGE_SIZE;
 
/* we don't need lock here, nobody else touches the iova range */
-   while (start  end) {
+   while (npages--) {
dma_pte_clear_one(domain, start);
start += VTD_PAGE_SIZE;
}
@@ -2277,7 +2279,7 @@ static dma_addr_t __intel_map_single(struct device 
*hwdev, phys_addr_t paddr,
 error:
if (iova)
__free_iova(domain-iovad, iova);
-   printk(KERN_ERRDevice %s request: %...@%llx dir %d --- failed\n,
+   printk(KERN_ERRDevice %s request: %...@%llx dir %d --- failed\n,
pci_name(pdev), size, (unsigned long long)paddr, dir);
return 0;
 }
@@ -2373,7 +2375,7 @@ void intel_unmap_single(struct device *dev, dma_addr_t 
dev_addr, size_t size,
start_addr = iova-pfn_lo  PAGE_SHIFT;
size = aligned_size((u64)dev_addr, size);
 
-   pr_debug(Device %s unmapping: %...@%llx\n,
+   pr_debug(Device %s unmapping: %...@%llx\n,
pci_name(pdev), size, (unsigned long long)start_addr);
 
/*  clear the whole page */
@@ -2431,8 +2433,6 @@ void intel_free_coherent(struct device *hwdev, size_t 
size, void *vaddr,
free_pages((unsigned long)vaddr, order);
 }
 
-#define SG_ENT_VIRT_ADDRESS(sg)(sg_virt((sg)))
-
 void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
int nelems, int dir)
 {
@@ -2442,7 +2442,7 @@ void intel_unmap_sg(struct device *hwdev, struct 
scatterlist *sglist,
unsigned long start_addr;
struct iova *iova;
size_t size = 0;
-   void *addr;
+   phys_addr_t addr;
struct scatterlist *sg;
struct intel_iommu *iommu;
 
@@ -2458,7 +2458,7 @@ void intel_unmap_sg(struct device *hwdev, struct 
scatterlist *sglist,
if (!iova)
return;
for_each_sg(sglist, sg, nelems, i) {
-   addr = SG_ENT_VIRT_ADDRESS(sg);
+   addr = page_to_phys(sg_page(sg)) + sg-offset;
size += aligned_size((u64)addr, sg-length);
}
 
@@ -2485,7 +2485,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
 
for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
-   sg-dma_address = virt_to_bus(SG_ENT_VIRT_ADDRESS(sg));
+   sg-dma_address = page_to_phys(sg_page(sg)) + sg-offset;
sg-dma_length = sg-length;
}
return nelems;
@@ -2494,7 +2494,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
 int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems,
 int dir)
 {
-   void *addr;
+   phys_addr_t addr;
int i;
struct pci_dev *pdev = to_pci_dev(hwdev);
struct dmar_domain *domain;
@@ -2518,8 +2518,7 @@ int intel_map_sg(struct device *hwdev, struct scatterlist 
*sglist, int nelems,
iommu = domain_get_iommu(domain);
 
for_each_sg(sglist, sg, nelems, i) {
-   addr = SG_ENT_VIRT_ADDRESS(sg);
-   addr = (void *)virt_to_phys(addr);
+   addr = page_to_phys(sg_page(sg)) + sg-offset;
size += aligned_size((u64)addr, sg-length);
}
 
@@ -2542,8 +2541,7 @@ int intel_map_sg(struct device *hwdev, struct scatterlist 
*sglist, int nelems,
start_addr = iova-pfn_lo  PAGE_SHIFT;
offset = 0

[PATCH v9 0/7] PCI: Linux kernel SR-IOV support

2009-02-16 Thread Yu Zhao

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
http://download.intel.com/design/network/ProdBrf/320025.pdf
http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf

Physical Function driver for Intel 82576 NIC (based on drivers/net/igb/)
will come soon.

Major changes from v8 to v9:
  1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen)
  2, block user config accesses before clearing VF Enable bit (Matthew Wilcox)

Yu Zhao (7):
  PCI: initialize and release SR-IOV capability
  PCI: restore saved SR-IOV state
  PCI: reserve bus range for SR-IOV device
  PCI: add SR-IOV API for Physical Function driver
  PCI: handle SR-IOV Virtual Function Migration
  PCI: document SR-IOV sysfs entries
  PCI: manual for SR-IOV user and driver developer

 Documentation/ABI/testing/sysfs-bus-pci |   27 ++
 Documentation/DocBook/kernel-api.tmpl   |1 +
 Documentation/PCI/pci-iov-howto.txt |   99 +
 drivers/pci/Kconfig |   13 +
 drivers/pci/Makefile|3 +
 drivers/pci/iov.c   |  707 +++
 drivers/pci/pci.c   |8 +
 drivers/pci/pci.h   |   53 +++
 drivers/pci/probe.c |7 +
 include/linux/pci.h |   28 ++
 include/linux/pci_regs.h|   33 ++
 11 files changed, 979 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt
 create mode 100644 drivers/pci/iov.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v9 2/7] PCI: restore saved SR-IOV state

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   25 +
 drivers/pci/pci.c |1 +
 drivers/pci/pci.h |4 
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e6736d4..1cc879b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -128,6 +128,21 @@ static void sriov_release(struct pci_dev *dev)
dev-sriov = NULL;
 }
 
+static void sriov_restore_state(struct pci_dev *dev)
+{
+   u16 ctrl;
+   struct pci_sriov *iov = dev-sriov;
+
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE)
+   return;
+
+   pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+   if (iov-ctrl  PCI_SRIOV_CTRL_VFE)
+   msleep(100);
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -179,3 +194,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
return dev-sriov-pos + PCI_SRIOV_BAR +
4 * (resno - PCI_SRIOV_RESOURCES);
 }
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+   if (dev-sriov)
+   sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c4f14f3..f791dcf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+   pci_restore_iov_state(dev);
 
return 0;
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d2dc6b7..9d76737 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, 
int resno,
 {
return 0;
 }
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v9 1/7] PCI: initialize and release SR-IOV capability

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/Kconfig  |   13 
 drivers/pci/Makefile |3 +
 drivers/pci/iov.c|  181 ++
 drivers/pci/pci.c|7 ++
 drivers/pci/pci.h|   37 ++
 drivers/pci/probe.c  |4 +
 include/linux/pci.h  |8 ++
 include/linux/pci_regs.h |   33 +
 8 files changed, 286 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..e8ea3e8 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,16 @@ config HT_IRQ
   This allows native hypertransport devices to use interrupts.
 
   If unsure say Y.
+
+config PCI_IOV
+   bool PCI IOV support
+   depends on PCI
+   select PCI_MSI
+   default n
+   help
+ PCI-SIG I/O Virtualization (IOV) Specifications support.
+ Single Root IOV: allows the Physical Function driver to enable
+ the hardware capability, so the Virtual Function is accessible
+ via the PCI Configuration Space using its own Bus, Device and
+ Function Numbers. Each Virtual Function also has the PCI Memory
+ Space to map the device specific register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba99282 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+# PCI IOV support
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 000..e6736d4
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,181 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include linux/pci.h
+#include linux/mutex.h
+#include linux/string.h
+#include linux/delay.h
+#include pci.h
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+   int i;
+   int rc;
+   int nres;
+   u32 pgsz;
+   u16 ctrl, total, offset, stride;
+   struct pci_sriov *iov;
+   struct resource *res;
+   struct pci_dev *pdev;
+
+   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
+   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE) {
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+   ssleep(1);
+   }
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total);
+   if (!total)
+   return 0;
+
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev-sriov)
+   break;
+   if (list_empty(dev-bus-devices) || !pdev-sriov)
+   pdev = NULL;
+
+   ctrl = 0;
+   if (!pdev  pci_ari_enabled(dev-bus))
+   ctrl |= PCI_SRIOV_CTRL_ARI;
+
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
+   if (!offset || (total  1  !stride))
+   return -EIO;
+
+   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
+   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
+   pgsz = ~((1  i) - 1);
+   if (!pgsz)
+   return -EIO;
+
+   pgsz = ~(pgsz - 1);
+   pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+   nres = 0;
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   i += __pci_read_base(dev, pci_bar_unknown, res,
+pos + PCI_SRIOV_BAR + i * 4);
+   if (!res-flags)
+   continue;
+   if (resource_size(res)  (PAGE_SIZE - 1)) {
+   rc = -EIO;
+   goto failed;
+   }
+   res-end = res-start + resource_size(res) * total - 1;
+   nres++;
+   }
+
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
+   iov-pos = pos;
+   iov-nres = nres;
+   iov-ctrl = ctrl;
+   iov-total = total;
+   iov-offset = offset;
+   iov-stride = stride;
+   iov-pgsz = pgsz;
+   iov-self = dev;
+   pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap);
+   pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link);
+
+   if (pdev)
+   iov-pdev = pci_dev_get(pdev);
+   else {
+   iov-pdev = dev

[PATCH v9 3/7] PCI: reserve bus range for SR-IOV device

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |   34 ++
 drivers/pci/pci.h   |5 +
 drivers/pci/probe.c |3 +++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 1cc879b..c89fcb1 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -14,6 +14,16 @@
 #include pci.h
 
 
+static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
+{
+   u16 bdf;
+
+   bdf = (dev-bus-number  8) + dev-devfn +
+ dev-sriov-offset + dev-sriov-stride * id;
+   *busnr = bdf  8;
+   *devfn = bdf  0xff;
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
int i;
@@ -204,3 +214,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
if (dev-sriov)
sriov_restore_state(dev);
 }
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+   int max = 0;
+   u8 busnr, devfn;
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   if (!dev-sriov)
+   continue;
+   virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn);
+   if (busnr  max)
+   max = busnr;
+   }
+
+   return max ? max - bus-number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9d76737..fdfc476 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev 
*dev, int resno,
 static inline void pci_restore_iov_state(struct pci_dev *dev)
 {
 }
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+   return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus 
*bus)
for (devfn = 0; devfn  0x100; devfn += 8)
pci_scan_slot(bus, devfn);
 
+   /* Reserve buses for SR-IOV capability. */
+   max += pci_iov_bus_range(bus);
+
/*
 * After performing arch-dependent fixup of the bus, look behind
 * all PCI-to-PCI bridges on this bus.
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v9 5/7] PCI: handle SR-IOV Virtual Function Migration

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  119 +++
 drivers/pci/pci.h   |4 ++
 include/linux/pci.h |6 +++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e4e2dac..127f643 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -206,6 +206,97 @@ static void sriov_release_dev(struct device *dev)
iov-nr_virtfn = 0;
 }
 
+static int sriov_migration(struct pci_dev *dev)
+{
+   u16 status;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (!iov-nr_virtfn)
+   return 0;
+
+   if (!(iov-cap  PCI_SRIOV_CAP_VFM))
+   return 0;
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   if (!(status  PCI_SRIOV_STATUS_VFM))
+   return 0;
+
+   schedule_work(iov-mtask);
+
+   return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+   int i;
+   u8 state;
+   u16 status;
+   struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+   for (i = iov-initial; i  iov-nr_virtfn; i++) {
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_MI) {
+   writeb(PCI_SRIOV_VFM_AV, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 1);
+   } else if (state == PCI_SRIOV_VFM_MO) {
+   virtfn_remove(iov-self, i, 1);
+   writeb(PCI_SRIOV_VFM_UA, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 0);
+   }
+   }
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   status = ~PCI_SRIOV_STATUS_VFM;
+   pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+   int bir;
+   u32 table;
+   resource_size_t pa;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (nr_virtfn = iov-initial)
+   return 0;
+
+   pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table);
+   bir = PCI_SRIOV_VFM_BIR(table);
+   if (bir  PCI_STD_RESOURCE_END)
+   return -EIO;
+
+   table = PCI_SRIOV_VFM_OFFSET(table);
+   if (table + nr_virtfn  pci_resource_len(dev, bir))
+   return -EIO;
+
+   pa = pci_resource_start(dev, bir) + table;
+   iov-mstate = ioremap(pa, nr_virtfn);
+   if (!iov-mstate)
+   return -ENOMEM;
+
+   INIT_WORK(iov-mtask, sriov_migration_task);
+
+   iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev-sriov;
+
+   iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   cancel_work_sync(iov-mtask);
+   iounmap(iov-mstate);
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -294,6 +385,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
goto failed2;
}
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM) {
+   rc = sriov_enable_migration(dev, nr_virtfn);
+   if (rc)
+   goto failed2;
+   }
+
kobject_uevent(dev-dev.kobj, KOBJ_CHANGE);
iov-nr_virtfn = nr_virtfn;
 
@@ -325,6 +422,9 @@ static void sriov_disable(struct pci_dev *dev)
if (!iov-nr_virtfn)
return;
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM)
+   sriov_disable_migration(dev);
+
for (i = 0; i  iov-nr_virtfn; i++)
virtfn_remove(dev, i, 0);
 
@@ -586,3 +686,22 @@ void pci_disable_sriov(struct pci_dev *dev)
sriov_disable(dev);
 }
 EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+   if (!dev-sriov)
+   return IRQ_NONE;
+
+   return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 328a611..51bebb2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1,6 +1,8 @@
 #ifndef

[PATCH v9 4/7] PCI: add SR-IOV API for Physical Function driver

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  348 +++
 drivers/pci/pci.h   |3 +
 include/linux/pci.h |   14 ++
 3 files changed, 365 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index c89fcb1..e4e2dac 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -13,6 +13,8 @@
 #include linux/delay.h
 #include pci.h
 
+#define VIRTFN_ID_LEN  8
+
 
 static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
 {
@@ -24,6 +26,319 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, 
u8 *busnr, u8 *devfn)
*devfn = bdf  0xff;
 }
 
+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+   int rc;
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return bus;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   if (child)
+   return child;
+
+   child = pci_add_new_bus(bus, NULL, busnr);
+   if (!child)
+   return NULL;
+
+   child-subordinate = busnr;
+   child-dev.parent = bus-bridge;
+   rc = pci_bus_add_child(child);
+   if (rc) {
+   pci_remove_bus(child);
+   return NULL;
+   }
+
+   return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   BUG_ON(!child);
+
+   if (list_empty(child-devices))
+   pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   int i;
+   int rc;
+   u64 size;
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct pci_dev *virtfn;
+   struct resource *res;
+   struct pci_sriov *iov = dev-sriov;
+
+   virtfn = alloc_pci_dev();
+   if (!virtfn)
+   return -ENOMEM;
+
+   virtfn_bdf(dev, id, busnr, devfn);
+   mutex_lock(iov-pdev-sriov-lock);
+   virtfn-bus = virtfn_add_bus(dev-bus, busnr);
+   if (!virtfn-bus) {
+   kfree(virtfn);
+   mutex_unlock(iov-pdev-sriov-lock);
+   return -ENOMEM;
+   }
+
+   virtfn-sysdata = dev-bus-sysdata;
+   virtfn-dev.parent = dev-dev.parent;
+   virtfn-dev.bus = dev-dev.bus;
+   virtfn-devfn = devfn;
+   virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL;
+   virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+   virtfn-error_state = pci_channel_io_normal;
+   virtfn-current_state = PCI_UNKNOWN;
+   virtfn-is_pcie = 1;
+   virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT;
+   virtfn-dma_mask = 0x;
+   virtfn-vendor = dev-vendor;
+   virtfn-subsystem_vendor = dev-subsystem_vendor;
+   virtfn-class = dev-class;
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device);
+   pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision);
+   pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
+virtfn-subsystem_device);
+
+   dev_set_name(virtfn-dev, %04x:%02x:%02x.%d,
+pci_domain_nr(virtfn-bus), busnr,
+PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   if (!res-parent)
+   continue;
+   virtfn-resource[i].name = pci_name(virtfn);
+   virtfn-resource[i].flags = res-flags;
+   size = resource_size(res);
+   do_div(size, iov-total);
+   virtfn-resource[i].start = res-start + size * id;
+   virtfn-resource[i].end = virtfn-resource[i].start + size - 1;
+   rc = request_resource(res, virtfn-resource[i]);
+   BUG_ON(rc);
+   }
+
+   if (reset)
+   pci_execute_reset_function(virtfn);
+
+   pci_device_add(virtfn, virtfn-bus);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   virtfn-physfn = pci_dev_get(dev);
+
+   rc = pci_bus_add_device(virtfn);
+   if (rc)
+   goto failed1;
+   sprintf(buf, %d, id);
+   rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf);
+   if (rc)
+   goto failed1;
+   rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn);
+   if (rc)
+   goto failed2;
+
+   kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE);
+
+   return 0;
+
+failed2:
+   sysfs_remove_link(iov-dev.kobj, buf);
+failed1:
+   pci_dev_put(dev);
+   mutex_lock(iov-pdev-sriov-lock);
+   pci_remove_bus_device(virtfn);
+   virtfn_remove_bus(dev-bus, busnr);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct

[PATCH v9 6/7] PCI: document SR-IOV sysfs entries

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/ABI/testing/sysfs-bus-pci |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..84dc100 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,30 @@ Description:
that some devices may have malformatted data.  If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What:  /sys/bus/pci/devices/.../virtfn/N
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it.
+   The symbol link points to the PCI device sysfs entry of
+   Virtual Function whose index is N (0...MaxVFs-1).
+
+What:  /sys/bus/pci/devices/.../virtfn/dep_link
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it,
+   and this device has vendor specific dependencies with
+   others. The symbol link points to the PCI device sysfs
+   entry of Physical Function this device depends on.
+
+What:  /sys/bus/pci/devices/.../physfn
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when a device is Virtual Function.
+   The symbol link points to the PCI device sysfs entry of
+   Physical Function this device associates with.
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v9 7/7] PCI: manual for SR-IOV user and driver developer

2009-02-16 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/DocBook/kernel-api.tmpl |1 +
 Documentation/PCI/pci-iov-howto.txt   |   99 +
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl 
b/Documentation/DocBook/kernel-api.tmpl
index 5818ff7..506e611 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
 --
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
  /sect1
  sect1titlePCI Hotplug Support Library/title
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt 
b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+   PCI Express I/O Virtualization Howto
+   Copyright (C) 2009 Intel Corporation
+   Yu Zhao yu.z...@intel.com
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+   'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+   void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+   irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id 
*id)
+{
+   pci_enable_sriov(dev, NR_VIRTFN);
+
+   ...
+
+   return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+   pci_disable_sriov(dev);
+
+   ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+   ...
+
+   return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+   ...
+
+   return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+   ...
+}
+
+static struct pci_driver dev_driver = {
+   .name = SR-IOV Physical Function driver,
+   .id_table = dev_id_table,
+   .probe =dev_probe,
+   .remove =   __devexit_p(dev_remove),
+   .suspend =  dev_suspend,
+   .resume =   dev_resume,
+   .shutdown = dev_shutdown,
+};
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Yu Zhao

On Sat, Feb 14, 2009 at 12:56:44AM +0800, Andi Kleen wrote:
 Yu Zhao yu.z...@intel.com writes:
  +
  +
  +static int sriov_init(struct pci_dev *dev, int pos)
  +{
  +   int i;
  +   int rc;
  +   int nres;
  +   u32 pgsz;
  +   u16 ctrl, total, offset, stride;
  +   struct pci_sriov *iov;
  +   struct resource *res;
  +   struct pci_dev *pdev;
  +
  +   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
  +   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
  +   return -ENODEV;
  +
 
 It would be a good idea to put a might_sleep() here just in 
 case the msleep happens below and drivers call it incorrectly.

Yes, will do.

  +   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
  +   if (ctrl  PCI_SRIOV_CTRL_VFE) {
  +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
  +   msleep(100);
 
 That's really long. Hopefully that's really needed.

It's needed according to SR-IOV spec, however, these lines clear
the VF Enable bit if the BIOS or something else has set it. So it
doesn't always run into this.

  +
  +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
  +   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
  +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
  +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
  +   if (!offset || (total  1  !stride))
  +   return -EIO;
  +
  +   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
  +   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
  +   pgsz = ~((1  i) - 1);
  +   if (!pgsz)
  +   return -EIO;
 
 All the error paths don't seem to undo the config space writes.
 How will the devices behave with half initialized context?

Since the VF Enable bit is cleared before the initialization, setting
others SR-IOV registers won't change state of the device. So it should
be OK even without undo these writes as long as the VF Enable bit is
not set.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Yu Zhao

On Sat, Feb 14, 2009 at 01:49:59AM +0800, Matthew Wilcox wrote:
 On Fri, Feb 13, 2009 at 05:56:44PM +0100, Andi Kleen wrote:
   + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
   + if (ctrl  PCI_SRIOV_CTRL_VFE) {
   + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
   + msleep(100);
  
  That's really long. Hopefully that's really needed.
 
 Yes and no.  The spec says:
 
   To allow components to perform internal initialization, system software
   must wait for at least 100 ms after changing the VF Enable bit from
   a 0 to a 1, before it is permitted to issue Configuration Requests to
   the VFs which are enabled by that VF Enable bit.
 
 So we don't have to wait here, but we do have to wait before exposing
 all these virtual functions to the rest of the system.  Should we add
 more complexity, perhaps spawn a thread to do it asynchronously, or add
 0.1 seconds to device initialisation?  A question without an easy
 answer, iMO.

This clears the VF Enable bit only if the BIOS has set it, so it doesn't
always happen. Actually the `msleep(100)' should be `ssleep(1)' here,
according to the spec you showed us below. I remembered the waiting time
incorrectly as 100ms which is the requirment for setting the VF Enable
bit rather than clearing it.

   +
   + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
   + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
   + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
   + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
   + if (!offset || (total  1  !stride))
   + return -EIO;
   +
   + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
   + i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
   + pgsz = ~((1  i) - 1);
   + if (!pgsz)
   + return -EIO;
  
  All the error paths don't seem to undo the config space writes.
  How will the devices behave with half initialized context?
 
 I think we should clear the VF_ENABLE bit.  That action is also fraught
 with danger:

The VF Eanble bit hasn't been set yet :-) Actually the spec forbids the
s/w to write those registers (NumVFs, Supported Page Size, etc.) when the
enabling bit is set.

 
   If software Clears VF Enable, software must allow 1 second after VF
   Enable is Cleared before reading any field in the SR-IOV Extended
   Capability or the VF Migration State Array (see Section 3.3.15.1).
 
 Another msleep(1000) here?  Not pretty, but what else can we do?
 
 Not to mention the danger of something else innocently using lspci -
 to read a field in the extended capability -- I suspect we also need to
 block user config accesses before clearing this bit.

Yes, we should block user config access.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/6] ATS capability support for Intel IOMMU

2009-02-12 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. ATS makes the PCI Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation in
the Endpoint, thus alleviate IOMMU pressure and improve the hardware
performance in the I/O virtualization environment.


Changelog: v2 - v3
  1, throw error message if VT-d hardware detects invalid descriptor
 on Queued Invalidation interface (David Woodhouse)
  2, avoid using pci_find_ext_capability every time when reading ATS
 Invalidate Queue Depth (Matthew Wilcox)
Changelog: v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)


Yu Zhao (6):
  PCI: support the ATS capability
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add queue invalidation fault status support
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c   |  230 ++
 drivers/pci/intel-iommu.c|  135 -
 drivers/pci/intr_remapping.c |   21 ++--
 drivers/pci/pci.c|   72 +
 include/linux/dmar.h |9 ++
 include/linux/intel-iommu.h  |   19 +++-
 include/linux/pci.h  |   16 +++
 include/linux/pci_regs.h |   10 ++
 8 files changed, 457 insertions(+), 55 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/6] PCI: support the ATS capability

2009-02-12 Thread Yu Zhao

The ATS spec can be found at http://www.pcisig.com/specifications/iov/ats/
(it requires membership).

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/pci.c|   72 ++
 include/linux/pci.h  |   16 ++
 include/linux/pci_regs.h |   10 ++
 3 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e3efe6b..87018ab 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1462,6 +1462,78 @@ void pci_enable_ari(struct pci_dev *dev)
 }
 
 /**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @ps: the IOMMU page shift
+ *
+ * Returns 0 on success, or a negative value on error.
+ */
+int pci_enable_ats(struct pci_dev *dev, int ps)
+{
+   int pos;
+   u16 ctrl;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   if (ps  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   ctrl = PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU) | PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+
+   dev-ats_enabled = 1;
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   int pos;
+   u16 ctrl;
+
+   if (!dev-ats_enabled)
+   return;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+}
+
+/**
+ * pci_ats_queue_depth - query ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or 0 on error.
+ */
+int pci_ats_queue_depth(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   if (dev-ats_qdep)
+   return dev-ats_qdep;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return 0;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+   dev-ats_qdep = PCI_ATS_CAP_QDEP(cap) ?  PCI_ATS_CAP_QDEP(cap) :
+PCI_ATS_MAX_QDEP;
+   return dev-ats_qdep;
+}
+
+/**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
  * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7bd624b..cab680b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -254,6 +254,7 @@ struct pci_dev {
unsigned intmsi_enabled:1;
unsigned intmsix_enabled:1;
unsigned intari_enabled:1;  /* ARI forwarding */
+   unsigned intats_enabled:1;  /* Address Translation Service */
unsigned intis_managed:1;
unsigned intis_pcie:1;
unsigned intstate_saved:1;
@@ -270,6 +271,7 @@ struct pci_dev {
struct list_head msi_list;
 #endif
struct pci_vpd *vpd;
+   int ats_qdep;   /* ATS Invalidate Queue Depth */
 };
 
 extern struct pci_dev *alloc_pci_dev(void);
@@ -1194,5 +1196,19 @@ int pci_ext_cfg_avail(struct pci_dev *dev);
 
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
+extern int pci_enable_ats(struct pci_dev *dev, int ps);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_queue_depth(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return dev-ats_enabled;
+}
+
 #endif /* __KERNEL__ */
 #endif /* LINUX_PCI_H */
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 027815b..3858b4f 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -498,6 +498,7 @@
 #define PCI_EXT_CAP_ID_DSN 3
 #define PCI_EXT_CAP_ID_PWR 4
 #define PCI_EXT_CAP_ID_ARI 14
+#define PCI_EXT_CAP_ID_ATS 15
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS   4   /* Uncorrectable Error Status */
@@ -615,4 +616,13 @@
 #define  PCI_ARI_CTRL_ACS  0x0002  /* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)(((x)  4)  7) /* Function Group */
 
+/* Address Translation Service */
+#define PCI_ATS_CAP0x04/* ATS Capability Register */
+#define  PCI_ATS_CAP_QDEP(x)   ((x)  0x1f)/* Invalidate Queue Depth */
+#define  PCI_ATS_MAX_QDEP  32  /* Max Invalidate Queue Depth */
+#define PCI_ATS_CTRL   0x06/* ATS Control Register */
+#define  PCI_ATS_CTRL_ENABLE   0x8000  /* ATS Enable */
+#define  PCI_ATS_CTRL_STU(x)   ((x)  0x1f)/* Smallest Translation Unit */
+#define  PCI_ATS_MIN_STU   12  /* shift of minimum STU block */
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.6.1

--
To unsubscribe from

[PATCH v3 3/6] VT-d: add queue invalidation fault status support

2009-02-12 Thread Yu Zhao

Check fault register after submitting an queue invalidation request.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c   |   63 --
 drivers/pci/intr_remapping.c |   21 --
 include/linux/intel-iommu.h  |4 ++-
 3 files changed, 63 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index bd37b3c..66dda07 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -671,19 +671,53 @@ static inline void reclaim_free_desc(struct q_inval *qi)
}
 }
 
+static int qi_check_fault(struct intel_iommu *iommu, int index)
+{
+   u32 fault;
+   int head;
+   struct q_inval *qi = iommu-qi;
+   int wait_index = (index + 1) % QI_LENGTH;
+
+   fault = readl(iommu-reg + DMAR_FSTS_REG);
+
+   /*
+* If IQE happens, the head points to the descriptor associated
+* with the error. No new descriptors are fetched until the IQE
+* is cleared.
+*/
+   if (fault  DMA_FSTS_IQE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   if ((head  DMAR_IQ_OFFSET) == index) {
+   printk(KERN_ERR VT-d detected invalid descriptor: 
+   low=%llx, high=%llx\n,
+   (unsigned long long)qi-desc[index].low,
+   (unsigned long long)qi-desc[index].high);
+   memcpy(qi-desc[index], qi-desc[wait_index],
+   sizeof(struct qi_desc));
+   __iommu_flush_cache(iommu, qi-desc[index],
+   sizeof(struct qi_desc));
+   writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG);
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 /*
  * Submit the queued invalidation descriptor to the remapping
  * hardware unit and wait for its completion.
  */
-void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
+int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
+   int rc = 0;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
unsigned long flags;
 
if (!qi)
-   return;
+   return 0;
 
hw = qi-desc;
 
@@ -701,7 +735,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
hw[index] = *desc;
 
-   wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | 
QI_IWD_TYPE;
+   wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) |
+   QI_IWD_STATUS_WRITE | QI_IWD_TYPE;
wait_desc.high = virt_to_phys(qi-desc_status[wait_index]);
 
hw[wait_index] = wait_desc;
@@ -712,13 +747,11 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
qi-free_head = (qi-free_head + 2) % QI_LENGTH;
qi-free_cnt -= 2;
 
-   spin_lock(iommu-register_lock);
/*
 * update the HW tail register indicating the presence of
 * new descriptors.
 */
-   writel(qi-free_head  4, iommu-reg + DMAR_IQT_REG);
-   spin_unlock(iommu-register_lock);
+   writel(qi-free_head  DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG);
 
while (qi-desc_status[wait_index] != QI_DONE) {
/*
@@ -728,6 +761,10 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 * a deadlock where the interrupt context can wait indefinitely
 * for free slots in the queue.
 */
+   rc = qi_check_fault(iommu, index);
+   if (rc)
+   break;
+
spin_unlock(qi-q_lock);
cpu_relax();
spin_lock(qi-q_lock);
@@ -737,6 +774,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
+
+   return rc;
 }
 
 /*
@@ -749,13 +788,13 @@ void qi_global_iec(struct intel_iommu *iommu)
desc.low = QI_IEC_TYPE;
desc.high = 0;
 
+   /* should never fail */
qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
 u64 type, int non_present_entry_flush)
 {
-
struct qi_desc desc;
 
if (non_present_entry_flush) {
@@ -769,10 +808,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, 
u16 sid, u8 fm,
| QI_CC_GRAN(type) | QI_CC_TYPE;
desc.high = 0;
 
-   qi_submit_sync(desc, iommu);
-
-   return 0;
-
+   return qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
@@ -802,10 +838,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih)
| QI_IOTLB_AM(size_order

[PATCH v3 4/6] VT-d: add device IOTLB invalidation support

2009-02-12 Thread Yu Zhao

Support device IOTLB invalidation to flush the translation cached in the
Endpoint.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |   63 --
 include/linux/intel-iommu.h |   13 -
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index 66dda07..93b38e7 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -664,7 +664,8 @@ void free_iommu(struct intel_iommu *iommu)
  */
 static inline void reclaim_free_desc(struct q_inval *qi)
 {
-   while (qi-desc_status[qi-free_tail] == QI_DONE) {
+   while (qi-desc_status[qi-free_tail] == QI_DONE ||
+  qi-desc_status[qi-free_tail] == QI_ABORT) {
qi-desc_status[qi-free_tail] = QI_FREE;
qi-free_tail = (qi-free_tail + 1) % QI_LENGTH;
qi-free_cnt++;
@@ -674,10 +675,13 @@ static inline void reclaim_free_desc(struct q_inval *qi)
 static int qi_check_fault(struct intel_iommu *iommu, int index)
 {
u32 fault;
-   int head;
+   int head, tail;
struct q_inval *qi = iommu-qi;
int wait_index = (index + 1) % QI_LENGTH;
 
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+
fault = readl(iommu-reg + DMAR_FSTS_REG);
 
/*
@@ -701,6 +705,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
}
}
 
+   /*
+* If ITE happens, all pending wait_desc commands are aborted.
+* No new descriptors are fetched until the ITE is cleared.
+*/
+   if (fault  DMA_FSTS_ITE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   head = ((head  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+   head |= 1;
+   tail = readl(iommu-reg + DMAR_IQT_REG);
+   tail = ((tail  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+
+   writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG);
+
+   do {
+   if (qi-desc_status[head] == QI_IN_USE)
+   qi-desc_status[head] = QI_ABORT;
+   head = (head - 2 + QI_LENGTH) % QI_LENGTH;
+   } while (head != tail);
+
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+   }
+
+   if (fault  DMA_FSTS_ICE)
+   writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG);
+
return 0;
 }
 
@@ -710,7 +740,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
  */
 int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
-   int rc = 0;
+   int rc;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
@@ -721,6 +751,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 
hw = qi-desc;
 
+restart:
+   rc = 0;
+
spin_lock_irqsave(qi-q_lock, flags);
while (qi-free_cnt  3) {
spin_unlock_irqrestore(qi-q_lock, flags);
@@ -775,6 +808,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
 
+   if (rc == -EAGAIN)
+   goto restart;
+
return rc;
 }
 
@@ -841,6 +877,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
return qi_submit_sync(desc, iommu);
 }
 
+int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
+   u64 addr, unsigned mask)
+{
+   struct qi_desc desc;
+
+   if (mask) {
+   BUG_ON(addr  ((1  (VTD_PAGE_SHIFT + mask)) - 1));
+   addr |= (1  (VTD_PAGE_SHIFT + mask - 1)) - 1;
+   desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
+   } else
+   desc.high = QI_DEV_IOTLB_ADDR(addr);
+
+   if (qdep = QI_DEV_IOTLB_MAX_INVS)
+   qdep = 0;
+
+   desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
+  QI_DIOTLB_TYPE;
+
+   return qi_submit_sync(desc, iommu);
+}
+
 /*
  * Enable Queued Invalidation interface. This is a must to support
  * interrupt-remapping. Also used by DMA-remapping, which replaces
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 0a220c9..d82bdac 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 #define DMA_FSTS_PPF ((u32)2)
 #define DMA_FSTS_PFO ((u32)1)
 #define DMA_FSTS_IQE (1  4)
+#define DMA_FSTS_ICE (1  5)
+#define DMA_FSTS_ITE (1  6)
 #define dma_fsts_fault_record_index(s) (((s)  8)  0xff)
 
 /* FRCD_REG, 32 bits access */
@@ -224,7 +226,8 @@ do {
\
 enum {
QI_FREE,
QI_IN_USE,
-   QI_DONE
+   QI_DONE,
+   QI_ABORT
 };
 
 #define

[PATCH v3 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-02-12 Thread Yu Zhao

Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c |   46 +---
 1 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f4b7c79..5fdbed3 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -925,30 +925,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
-   unsigned int mask;
+   int rc;
+   unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 
BUG_ON(addr  (~VTD_PAGE_MASK));
BUG_ON(pages == 0);
 
-   /* Fallback to domain selective flush if no PSI support */
-   if (!cap_pgsel_inv(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH,
-   non_present_entry_flush);
-
/*
+* Fallback to domain selective flush if no PSI support or the size is
+* too big.
 * PSI requires page size to be 2 ^ x, and the base address is naturally
 * aligned to the size
 */
-   mask = ilog2(__roundup_pow_of_two(pages));
-   /* Fallback to domain selective flush if size is too big */
-   if (mask  cap_max_amask_val(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH, non_present_entry_flush);
-
-   return iommu-flush.flush_iotlb(iommu, did, addr, mask,
-   DMA_TLB_PSI_FLUSH,
-   non_present_entry_flush);
+   if (!cap_pgsel_inv(iommu-cap) || mask  cap_max_amask_val(iommu-cap))
+   rc = iommu-flush.flush_iotlb(iommu, did, 0, 0,
+   DMA_TLB_DSI_FLUSH,
+   non_present_entry_flush);
+   else
+   rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
+   DMA_TLB_PSI_FLUSH,
+   non_present_entry_flush);
+   return rc;
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2301,15 +2298,16 @@ static void flush_unmaps(void)
if (!iommu)
continue;
 
-   if (deferred_flush[i].next) {
-   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
-DMA_TLB_GLOBAL_FLUSH, 0);
-   for (j = 0; j  deferred_flush[i].next; j++) {
-   __free_iova(deferred_flush[i].domain[j]-iovad,
-   deferred_flush[i].iova[j]);
-   }
-   deferred_flush[i].next = 0;
+   if (!deferred_flush[i].next)
+   continue;
+
+   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
+DMA_TLB_GLOBAL_FLUSH, 0);
+   for (j = 0; j  deferred_flush[i].next; j++) {
+   __free_iova(deferred_flush[i].domain[j]-iovad,
+   deferred_flush[i].iova[j]);
}
+   deferred_flush[i].next = 0;
}
 
list_size = 0;
-- 
1.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 6/6] VT-d: support the device IOTLB

2009-02-12 Thread Yu Zhao

Support device IOTLB (i.e. ATS) for both native and KVM environments.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c   |   95 +-
 include/linux/intel-iommu.h |1 +
 2 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 5fdbed3..fe09e7a 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct 
context_entry *context)
 }
 
 #define CONTEXT_TT_MULTI_LEVEL 0
+#define CONTEXT_TT_DEV_IOTLB   1
 
 static inline void context_set_translation_type(struct context_entry *context,
unsigned long value)
@@ -240,6 +241,7 @@ struct device_domain_info {
struct list_head global; /* link to global list */
u8 bus; /* PCI bus numer */
u8 devfn;   /* PCI devfn number */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct dmar_domain *domain; /* pointer to domain */
 };
@@ -922,6 +924,74 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
return 0;
 }
 
+static struct device_domain_info *
+iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-dev  info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found)
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+   if (!pci_ats_queue_depth(info-dev))
+   return NULL;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (info-dev  pci_ats_enabled(info-dev))
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned mask)
+{
+   int rc;
+   u16 sid, qdep;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   qdep = pci_ats_queue_depth(info-dev);
+   rc = qi_flush_dev_iotlb(info-iommu, sid, qdep, addr, mask);
+   if (rc)
+   printk(KERN_ERR IOMMU: flush device IOTLB failed\n);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
@@ -945,6 +1015,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH,
non_present_entry_flush);
+   if (!rc  !non_present_entry_flush)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
+
return rc;
 }
 
@@ -1469,6 +1542,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1534,7 +1608,11 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
context_set_domain_id(context, id);
context_set_address_width(context, iommu-agaw);
context_set_address_root(context, virt_to_phys(pgd));
-   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
+   info = iommu_support_dev_iotlb(domain, bus, devfn);
+   if (info)
+   context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB);
+   else
+   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
context_set_fault_enable(context);
context_set_present(context);
domain_flush_cache(domain, context, sizeof(*context));
@@ -1546,6

[PATCH v3 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-02-12 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in DMA Remapping
Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f5a662a..bd37b3c 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -341,6 +425,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -409,11 +498,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct

[PATCH v8 0/7] PCI: Linux kernel SR-IOV support

2009-02-10 Thread Yu Zhao

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
http://download.intel.com/design/network/ProdBrf/320025.pdf
http://www.neterion.com/products/x3100.html

Physical Function driver for Intel 82576 NIC (based on drivers/net/igb/)
will come in few weeks.

Major changes from v7 to v8:
1, simplified the API for the PF driver
2, split the code and respin them against the latest tree

Yu Zhao (7):
  PCI: initialize and release SR-IOV capability
  PCI: restore saved SR-IOV state
  PCI: reserve bus range for SR-IOV device
  PCI: add SR-IOV API for Physical Function driver
  PCI: handle SR-IOV Virtual Function Migration
  PCI: document SR-IOV sysfs entries
  PCI: manual for SR-IOV user and driver developer

 Documentation/ABI/testing/sysfs-bus-pci |   27 ++
 Documentation/DocBook/kernel-api.tmpl   |1 +
 Documentation/PCI/pci-iov-howto.txt |  106 +
 drivers/pci/Kconfig |   13 +
 drivers/pci/Makefile|3 +
 drivers/pci/iov.c   |  692 +++
 drivers/pci/pci.c   |8 +
 drivers/pci/pci.h   |   53 +++
 drivers/pci/probe.c |7 +
 include/linux/pci.h |   28 ++
 include/linux/pci_regs.h|   33 ++
 11 files changed, 971 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt
 create mode 100644 drivers/pci/iov.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/Kconfig  |   13 
 drivers/pci/Makefile |3 +
 drivers/pci/iov.c|  178 ++
 drivers/pci/pci.c|7 ++
 drivers/pci/pci.h|   37 ++
 drivers/pci/probe.c  |4 +
 include/linux/pci.h  |8 ++
 include/linux/pci_regs.h |   33 +
 8 files changed, 283 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..2d0ca01 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,16 @@ config HT_IRQ
   This allows native hypertransport devices to use interrupts.
 
   If unsure say Y.
+
+config PCI_IOV
+   bool PCI IOV support
+   depends on PCI
+   select PCI_MSI
+   default n
+   help
+ PCI-SIG I/O Virtualization (IOV) Specifications support.
+ Single Root IOV: allows the Physical Function driver to enable
+ the hardware capability, so the Virtual Function is accessible
+ via the PCI Configuration Space using its own Bus, Device and
+ Function Numbers. Each Virtual Function also has the PCI Memory
+ Space to map the device specific register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba99282 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+# PCI IOV support
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 000..9a1fabd
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,178 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao yu.z...@intel.com
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include linux/pci.h
+#include pci.h
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+   int i;
+   int rc;
+   int nres;
+   u32 pgsz;
+   u16 ctrl, total, offset, stride;
+   struct pci_sriov *iov;
+   struct resource *res;
+   struct pci_dev *pdev;
+
+   if (dev-pcie_type != PCI_EXP_TYPE_RC_END 
+   dev-pcie_type != PCI_EXP_TYPE_ENDPOINT)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE) {
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+   msleep(100);
+   }
+
+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, total);
+   if (!total)
+   return 0;
+
+   list_for_each_entry(pdev, dev-bus-devices, bus_list)
+   if (pdev-sriov)
+   break;
+   if (list_empty(dev-bus-devices) || !pdev-sriov)
+   pdev = NULL;
+
+   ctrl = 0;
+   if (!pdev  pci_ari_enabled(dev-bus))
+   ctrl |= PCI_SRIOV_CTRL_ARI;
+
+   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, offset);
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, stride);
+   if (!offset || (total  1  !stride))
+   return -EIO;
+
+   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, pgsz);
+   i = PAGE_SHIFT  12 ? PAGE_SHIFT - 12 : 0;
+   pgsz = ~((1  i) - 1);
+   if (!pgsz)
+   return -EIO;
+
+   pgsz = ~(pgsz - 1);
+   pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+   nres = 0;
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   i += __pci_read_base(dev, pci_bar_unknown, res,
+pos + PCI_SRIOV_BAR + i * 4);
+   if (!res-flags)
+   continue;
+   if (resource_size(res)  (PAGE_SIZE - 1)) {
+   rc = -EIO;
+   goto failed;
+   }
+   res-end = res-start + resource_size(res) * total - 1;
+   nres++;
+   }
+
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov) {
+   rc = -ENOMEM;
+   goto failed;
+   }
+
+   iov-pos = pos;
+   iov-nres = nres;
+   iov-ctrl = ctrl;
+   iov-total = total;
+   iov-offset = offset;
+   iov-stride = stride;
+   iov-pgsz = pgsz;
+   iov-self = dev;
+   pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, iov-cap);
+   pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, iov-link);
+
+   if (pdev)
+   iov-pdev = pci_dev_get(pdev);
+   else {
+   iov-pdev = dev;
+   mutex_init(iov-lock);
+   }
+
+   dev-sriov = iov

[PATCH v8 3/7] PCI: reserve bus range for SR-IOV device

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |   34 ++
 drivers/pci/pci.h   |5 +
 drivers/pci/probe.c |3 +++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index bd389b4..1cf13be 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -11,6 +11,16 @@
 #include pci.h
 
 
+static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
+{
+   u16 bdf;
+
+   bdf = (dev-bus-number  8) + dev-devfn +
+ dev-sriov-offset + dev-sriov-stride * id;
+   *busnr = bdf  8;
+   *devfn = bdf  0xff;
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
int i;
@@ -201,3 +211,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
if (dev-sriov)
sriov_restore_state(dev);
 }
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+   int max = 0;
+   u8 busnr, devfn;
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   if (!dev-sriov)
+   continue;
+   virtfn_bdf(dev, dev-sriov-total - 1, busnr, devfn);
+   if (busnr  max)
+   max = busnr;
+   }
+
+   return max ? max - bus-number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9d76737..fdfc476 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev 
*dev, int resno,
 static inline void pci_restore_iov_state(struct pci_dev *dev)
 {
 }
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+   return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus 
*bus)
for (devfn = 0; devfn  0x100; devfn += 8)
pci_scan_slot(bus, devfn);
 
+   /* Reserve buses for SR-IOV capability. */
+   max += pci_iov_bus_range(bus);
+
/*
 * After performing arch-dependent fixup of the bus, look behind
 * all PCI-to-PCI bridges on this bus.
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 2/7] PCI: restore saved SR-IOV state

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   25 +
 drivers/pci/pci.c |1 +
 drivers/pci/pci.h |4 
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 9a1fabd..bd389b4 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -125,6 +125,21 @@ static void sriov_release(struct pci_dev *dev)
dev-sriov = NULL;
 }
 
+static void sriov_restore_state(struct pci_dev *dev)
+{
+   u16 ctrl;
+   struct pci_sriov *iov = dev-sriov;
+
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_CTRL, ctrl);
+   if (ctrl  PCI_SRIOV_CTRL_VFE)
+   return;
+
+   pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+   if (iov-ctrl  PCI_SRIOV_CTRL_VFE)
+   msleep(100);
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -176,3 +191,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
return dev-sriov-pos + PCI_SRIOV_BAR +
4 * (resno - PCI_SRIOV_RESOURCES);
 }
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+   if (dev-sriov)
+   sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c4f14f3..f791dcf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+   pci_restore_iov_state(dev);
 
return 0;
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d2dc6b7..9d76737 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, 
int resno,
 {
return 0;
 }
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 7/7] PCI: manual for SR-IOV user and driver developer

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/DocBook/kernel-api.tmpl |1 +
 Documentation/PCI/pci-iov-howto.txt   |  106 +
 2 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl 
b/Documentation/DocBook/kernel-api.tmpl
index 5818ff7..506e611 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
 --
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
  /sect1
  sect1titlePCI Hotplug Support Library/title
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt 
b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 000..9029369
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,106 @@
+   PCI Express I/O Virtualization Howto
+   Copyright (C) 2009 Intel Corporation
+   Yu Zhao yu.z...@intel.com
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+   'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+   void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+   irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id 
*id)
+{
+
+   dev-current_state = PCI_D0;
+
+   pci_enable_sriov(dev, NR_VIRTFN);
+
+   ...
+
+   return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+   pci_disable_sriov(dev);
+
+   ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+   ...
+
+   return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+   pci_restore_state(dev);
+
+   ...
+
+   return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+   ...
+}
+
+static struct pci_driver dev_driver = {
+   .name = SR-IOV Physical Function driver,
+   .id_table = dev_id_table,
+   .probe =dev_probe,
+   .remove =   __devexit_p(dev_remove),
+#ifdef CONFIG_PM
+   .suspend =  dev_suspend,
+   .resume =   dev_resume,
+#endif
+   .shutdown = dev_shutdown,
+};
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 6/7] PCI: document SR-IOV sysfs entries

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 Documentation/ABI/testing/sysfs-bus-pci |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..84dc100 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,30 @@ Description:
that some devices may have malformatted data.  If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What:  /sys/bus/pci/devices/.../virtfn/N
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it.
+   The symbol link points to the PCI device sysfs entry of
+   Virtual Function whose index is N (0...MaxVFs-1).
+
+What:  /sys/bus/pci/devices/.../virtfn/dep_link
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when hardware supports SR-IOV
+   capability and Physical Function driver has enabled it,
+   and this device has vendor specific dependencies with
+   others. The symbol link points to the PCI device sysfs
+   entry of Physical Function this device depends on.
+
+What:  /sys/bus/pci/devices/.../physfn
+Date:  February 2009
+Contact:   Yu Zhao yu.z...@intel.com
+Description:
+   This symbol link appears when a device is Virtual Function.
+   The symbol link points to the PCI device sysfs entry of
+   Physical Function this device associates with.
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 5/7] PCI: handle SR-IOV Virtual Function Migration

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  119 +++
 drivers/pci/pci.h   |4 ++
 include/linux/pci.h |6 +++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index d576160..d622167 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -203,6 +203,97 @@ static void sriov_release_dev(struct device *dev)
iov-nr_virtfn = 0;
 }
 
+static int sriov_migration(struct pci_dev *dev)
+{
+   u16 status;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (!iov-nr_virtfn)
+   return 0;
+
+   if (!(iov-cap  PCI_SRIOV_CAP_VFM))
+   return 0;
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   if (!(status  PCI_SRIOV_STATUS_VFM))
+   return 0;
+
+   schedule_work(iov-mtask);
+
+   return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+   int i;
+   u8 state;
+   u16 status;
+   struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+   for (i = iov-initial; i  iov-nr_virtfn; i++) {
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_MI) {
+   writeb(PCI_SRIOV_VFM_AV, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 1);
+   } else if (state == PCI_SRIOV_VFM_MO) {
+   virtfn_remove(iov-self, i, 1);
+   writeb(PCI_SRIOV_VFM_UA, iov-mstate + i);
+   state = readb(iov-mstate + i);
+   if (state == PCI_SRIOV_VFM_AV)
+   virtfn_add(iov-self, i, 0);
+   }
+   }
+
+   pci_read_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+   status = ~PCI_SRIOV_STATUS_VFM;
+   pci_write_config_word(iov-self, iov-pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+   int bir;
+   u32 table;
+   resource_size_t pa;
+   struct pci_sriov *iov = dev-sriov;
+
+   if (nr_virtfn = iov-initial)
+   return 0;
+
+   pci_read_config_dword(dev, iov-pos + PCI_SRIOV_VFM, table);
+   bir = PCI_SRIOV_VFM_BIR(table);
+   if (bir  PCI_STD_RESOURCE_END)
+   return -EIO;
+
+   table = PCI_SRIOV_VFM_OFFSET(table);
+   if (table + nr_virtfn  pci_resource_len(dev, bir))
+   return -EIO;
+
+   pa = pci_resource_start(dev, bir) + table;
+   iov-mstate = ioremap(pa, nr_virtfn);
+   if (!iov-mstate)
+   return -ENOMEM;
+
+   INIT_WORK(iov-mtask, sriov_migration_task);
+
+   iov-ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev-sriov;
+
+   iov-ctrl = ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
+
+   cancel_work_sync(iov-mtask);
+   iounmap(iov-mstate);
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -287,6 +378,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
goto failed2;
}
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM) {
+   rc = sriov_enable_migration(dev, nr_virtfn);
+   if (rc)
+   goto failed2;
+   }
+
kobject_uevent(dev-dev.kobj, KOBJ_CHANGE);
iov-nr_virtfn = nr_virtfn;
 
@@ -316,6 +413,9 @@ static void sriov_disable(struct pci_dev *dev)
if (!iov-nr_virtfn)
return;
 
+   if (iov-cap  PCI_SRIOV_CAP_VFM)
+   sriov_disable_migration(dev);
+
for (i = 0; i  iov-nr_virtfn; i++)
virtfn_remove(dev, i, 0);
 
@@ -571,3 +671,22 @@ void pci_disable_sriov(struct pci_dev *dev)
sriov_disable(dev);
 }
 EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+   if (!dev-sriov)
+   return IRQ_NONE;
+
+   return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 328a611..51bebb2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1,6 +1,8 @@
 #ifndef

[PATCH v8 4/7] PCI: add SR-IOV API for Physical Function driver

2009-02-10 Thread Yu Zhao

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c   |  336 +++
 drivers/pci/pci.h   |3 +
 include/linux/pci.h |   14 ++
 3 files changed, 353 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 1cf13be..d576160 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -10,6 +10,8 @@
 #include linux/pci.h
 #include pci.h
 
+#define VIRTFN_ID_LEN  8
+
 
 static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 
*devfn)
 {
@@ -21,6 +23,311 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, 
u8 *busnr, u8 *devfn)
*devfn = bdf  0xff;
 }
 
+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+   int rc;
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return bus;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   if (child)
+   return child;
+
+   child = pci_add_new_bus(bus, NULL, busnr);
+   if (!child)
+   return NULL;
+
+   child-subordinate = busnr;
+   child-dev.parent = bus-bridge;
+   rc = pci_bus_add_child(child);
+   if (rc) {
+   pci_remove_bus(child);
+   return NULL;
+   }
+
+   return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+   struct pci_bus *child;
+
+   if (bus-number == busnr)
+   return;
+
+   child = pci_find_bus(pci_domain_nr(bus), busnr);
+   BUG_ON(!child);
+
+   if (list_empty(child-devices))
+   pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   int i;
+   int rc;
+   u64 size;
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct pci_dev *virtfn;
+   struct resource *res;
+   struct pci_sriov *iov = dev-sriov;
+
+   virtfn = alloc_pci_dev();
+   if (!virtfn)
+   return -ENOMEM;
+
+   virtfn_bdf(dev, id, busnr, devfn);
+   mutex_lock(iov-pdev-sriov-lock);
+   virtfn-bus = virtfn_add_bus(dev-bus, busnr);
+   if (!virtfn-bus) {
+   kfree(virtfn);
+   mutex_unlock(iov-pdev-sriov-lock);
+   return -ENOMEM;
+   }
+
+   virtfn-sysdata = dev-bus-sysdata;
+   virtfn-dev.parent = dev-dev.parent;
+   virtfn-dev.bus = dev-dev.bus;
+   virtfn-devfn = devfn;
+   virtfn-hdr_type = PCI_HEADER_TYPE_NORMAL;
+   virtfn-cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+   virtfn-error_state = pci_channel_io_normal;
+   virtfn-current_state = PCI_UNKNOWN;
+   virtfn-is_pcie = 1;
+   virtfn-pcie_type = PCI_EXP_TYPE_ENDPOINT;
+   virtfn-dma_mask = 0x;
+   virtfn-vendor = dev-vendor;
+   virtfn-subsystem_vendor = dev-subsystem_vendor;
+   virtfn-class = dev-class;
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device);
+   pci_read_config_byte(virtfn, PCI_REVISION_ID, virtfn-revision);
+   pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
+virtfn-subsystem_device);
+
+   dev_set_name(virtfn-dev, %04x:%02x:%02x.%d,
+pci_domain_nr(virtfn-bus), busnr,
+PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = dev-resource + PCI_SRIOV_RESOURCES + i;
+   if (!res-parent)
+   continue;
+   virtfn-resource[i].name = pci_name(virtfn);
+   virtfn-resource[i].flags = res-flags;
+   size = resource_size(res);
+   do_div(size, iov-total);
+   virtfn-resource[i].start = res-start + size * id;
+   virtfn-resource[i].end = virtfn-resource[i].start + size - 1;
+   rc = request_resource(res, virtfn-resource[i]);
+   BUG_ON(rc);
+   }
+
+   if (reset)
+   pci_execute_reset_function(virtfn);
+
+   pci_device_add(virtfn, virtfn-bus);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   virtfn-physfn = pci_dev_get(dev);
+
+   rc = pci_bus_add_device(virtfn);
+   if (rc)
+   goto failed1;
+   sprintf(buf, %d, id);
+   rc = sysfs_create_link(iov-dev.kobj, virtfn-dev.kobj, buf);
+   if (rc)
+   goto failed1;
+   rc = sysfs_create_link(virtfn-dev.kobj, dev-dev.kobj, physfn);
+   if (rc)
+   goto failed2;
+
+   kobject_uevent(virtfn-dev.kobj, KOBJ_CHANGE);
+
+   return 0;
+
+failed2:
+   sysfs_remove_link(iov-dev.kobj, buf);
+failed1:
+   pci_dev_put(dev);
+   mutex_lock(iov-pdev-sriov-lock);
+   pci_remove_bus_device(virtfn);
+   virtfn_remove_bus(dev-bus, busnr);
+   mutex_unlock(iov-pdev-sriov-lock);
+
+   return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+   u8 busnr, devfn;
+   char buf[VIRTFN_ID_LEN];
+   struct

[PATCH v2 0/6] ATS capability support for Intel IOMMU

2009-01-17 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. ATS makes the PCI Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation in
the Endpoint, thus alleviate IOMMU pressure and improve the hardware
performance in the I/O virtualization environment.

Changelog: v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)


Yu Zhao (6):
  PCI: support the ATS capability
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add queue invalidation fault status support
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c   |  226 ++
 drivers/pci/intel-iommu.c|  137 +-
 drivers/pci/intr_remapping.c |   21 +++--
 drivers/pci/pci.c|   68 +
 include/linux/dmar.h |9 ++
 include/linux/intel-iommu.h  |   19 +++-
 include/linux/pci.h  |   15 +++
 include/linux/pci_regs.h |   10 ++
 8 files changed, 450 insertions(+), 55 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-01-17 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in DMA Remapping
Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f5a662a..bd37b3c 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -254,6 +254,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -261,22 +339,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -341,6 +425,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -409,11 +498,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct

[PATCH v2 3/6] VT-d: add queue invalidation fault status support

2009-01-17 Thread Yu Zhao

Check fault register after submitting an queue invalidation request.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c   |   59 +++--
 drivers/pci/intr_remapping.c |   21 --
 include/linux/intel-iommu.h  |4 ++-
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index bd37b3c..0c87ebd 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -671,19 +671,49 @@ static inline void reclaim_free_desc(struct q_inval *qi)
}
 }
 
+static int qi_check_fault(struct intel_iommu *iommu, int index)
+{
+   u32 fault;
+   int head;
+   struct q_inval *qi = iommu-qi;
+   int wait_index = (index + 1) % QI_LENGTH;
+
+   fault = readl(iommu-reg + DMAR_FSTS_REG);
+
+   /*
+* If IQE happens, the head points to the descriptor associated
+* with the error. No new descriptors are fetched until the IQE
+* is cleared.
+*/
+   if (fault  DMA_FSTS_IQE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   if ((head  DMAR_IQ_OFFSET) == index) {
+   memcpy(qi-desc[index], qi-desc[wait_index],
+   sizeof(struct qi_desc));
+   __iommu_flush_cache(iommu, qi-desc[index],
+   sizeof(struct qi_desc));
+   writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG);
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 /*
  * Submit the queued invalidation descriptor to the remapping
  * hardware unit and wait for its completion.
  */
-void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
+int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
+   int rc = 0;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
unsigned long flags;
 
if (!qi)
-   return;
+   return 0;
 
hw = qi-desc;
 
@@ -701,7 +731,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
hw[index] = *desc;
 
-   wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | 
QI_IWD_TYPE;
+   wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) |
+   QI_IWD_STATUS_WRITE | QI_IWD_TYPE;
wait_desc.high = virt_to_phys(qi-desc_status[wait_index]);
 
hw[wait_index] = wait_desc;
@@ -712,13 +743,11 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
qi-free_head = (qi-free_head + 2) % QI_LENGTH;
qi-free_cnt -= 2;
 
-   spin_lock(iommu-register_lock);
/*
 * update the HW tail register indicating the presence of
 * new descriptors.
 */
-   writel(qi-free_head  4, iommu-reg + DMAR_IQT_REG);
-   spin_unlock(iommu-register_lock);
+   writel(qi-free_head  DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG);
 
while (qi-desc_status[wait_index] != QI_DONE) {
/*
@@ -728,6 +757,10 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 * a deadlock where the interrupt context can wait indefinitely
 * for free slots in the queue.
 */
+   rc = qi_check_fault(iommu, index);
+   if (rc)
+   break;
+
spin_unlock(qi-q_lock);
cpu_relax();
spin_lock(qi-q_lock);
@@ -737,6 +770,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
+
+   return rc;
 }
 
 /*
@@ -749,13 +784,13 @@ void qi_global_iec(struct intel_iommu *iommu)
desc.low = QI_IEC_TYPE;
desc.high = 0;
 
+   /* should never fail */
qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
 u64 type, int non_present_entry_flush)
 {
-
struct qi_desc desc;
 
if (non_present_entry_flush) {
@@ -769,10 +804,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, 
u16 sid, u8 fm,
| QI_CC_GRAN(type) | QI_CC_TYPE;
desc.high = 0;
 
-   qi_submit_sync(desc, iommu);
-
-   return 0;
-
+   return qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
@@ -802,10 +834,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih)
| QI_IOTLB_AM(size_order);
 
-   qi_submit_sync(desc, iommu);
-
-   return 0;
-
+   return qi_submit_sync(desc, iommu);
 }
 
 /*
diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c
index f78371b..45effc5 100644
--- a/drivers/pci/intr_remapping.c
+++ b/drivers/pci/intr_remapping.c

[PATCH v2 6/6] VT-d: support the device IOTLB

2009-01-17 Thread Yu Zhao

Support device IOTLB (i.e. ATS) for both native and KVM environments.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c   |   97 +-
 include/linux/intel-iommu.h |1 +
 2 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index df92764..fb84d82 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct 
context_entry *context)
 }
 
 #define CONTEXT_TT_MULTI_LEVEL 0
+#define CONTEXT_TT_DEV_IOTLB   1
 
 static inline void context_set_translation_type(struct context_entry *context,
unsigned long value)
@@ -240,6 +241,8 @@ struct device_domain_info {
struct list_head global; /* link to global list */
u8 bus; /* PCI bus numer */
u8 devfn;   /* PCI devfn number */
+   int qdep;   /* invalidate queue depth */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct dmar_domain *domain; /* pointer to domain */
 };
@@ -914,6 +917,75 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
return 0;
 }
 
+static struct device_domain_info *
+iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-dev  info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found)
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+   info-qdep = pci_ats_qdep(info-dev);
+   if (!info-qdep)
+   return NULL;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (info-dev  pci_ats_enabled(info-dev))
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned int mask)
+{
+   int rc;
+   u16 sid;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   rc = qi_flush_dev_iotlb(info-iommu, sid,
+   info-qdep, addr, mask);
+   if (rc)
+   printk(KERN_ERR IOMMU: flush device IOTLB failed\n);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
@@ -937,6 +1009,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH,
non_present_entry_flush);
+   if (!rc  !non_present_entry_flush)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
+
return rc;
 }
 
@@ -1461,6 +1536,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1526,7 +1602,11 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
context_set_domain_id(context, id);
context_set_address_width(context, iommu-agaw);
context_set_address_root(context, virt_to_phys(pgd));
-   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
+   info = iommu_support_dev_iotlb(domain, bus, devfn);
+   if (info)
+   context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB);
+   else
+   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
context_set_fault_enable(context);
context_set_present(context

[PATCH v2 4/6] VT-d: add device IOTLB invalidation support

2009-01-17 Thread Yu Zhao

Support device IOTLB invalidation to flush the translation cached in the
Endpoint.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |   63 --
 include/linux/intel-iommu.h |   13 -
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index 0c87ebd..4fea360 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -664,7 +664,8 @@ void free_iommu(struct intel_iommu *iommu)
  */
 static inline void reclaim_free_desc(struct q_inval *qi)
 {
-   while (qi-desc_status[qi-free_tail] == QI_DONE) {
+   while (qi-desc_status[qi-free_tail] == QI_DONE ||
+  qi-desc_status[qi-free_tail] == QI_ABORT) {
qi-desc_status[qi-free_tail] = QI_FREE;
qi-free_tail = (qi-free_tail + 1) % QI_LENGTH;
qi-free_cnt++;
@@ -674,10 +675,13 @@ static inline void reclaim_free_desc(struct q_inval *qi)
 static int qi_check_fault(struct intel_iommu *iommu, int index)
 {
u32 fault;
-   int head;
+   int head, tail;
struct q_inval *qi = iommu-qi;
int wait_index = (index + 1) % QI_LENGTH;
 
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+
fault = readl(iommu-reg + DMAR_FSTS_REG);
 
/*
@@ -697,6 +701,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
}
}
 
+   /*
+* If ITE happens, all pending wait_desc commands are aborted.
+* No new descriptors are fetched until the ITE is cleared.
+*/
+   if (fault  DMA_FSTS_ITE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   head = ((head  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+   head |= 1;
+   tail = readl(iommu-reg + DMAR_IQT_REG);
+   tail = ((tail  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+
+   writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG);
+
+   do {
+   if (qi-desc_status[head] == QI_IN_USE)
+   qi-desc_status[head] = QI_ABORT;
+   head = (head - 2 + QI_LENGTH) % QI_LENGTH;
+   } while (head != tail);
+
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+   }
+
+   if (fault  DMA_FSTS_ICE)
+   writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG);
+
return 0;
 }
 
@@ -706,7 +736,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
  */
 int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
-   int rc = 0;
+   int rc;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
@@ -717,6 +747,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 
hw = qi-desc;
 
+restart:
+   rc = 0;
+
spin_lock_irqsave(qi-q_lock, flags);
while (qi-free_cnt  3) {
spin_unlock_irqrestore(qi-q_lock, flags);
@@ -771,6 +804,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
 
+   if (rc == -EAGAIN)
+   goto restart;
+
return rc;
 }
 
@@ -837,6 +873,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
return qi_submit_sync(desc, iommu);
 }
 
+int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, int qdep,
+   u64 addr, unsigned int mask)
+{
+   struct qi_desc desc;
+
+   if (mask) {
+   BUG_ON(addr  ((1  (VTD_PAGE_SHIFT + mask)) - 1));
+   addr |= (1  (VTD_PAGE_SHIFT + mask - 1)) - 1;
+   desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
+   } else
+   desc.high = QI_DEV_IOTLB_ADDR(addr);
+
+   if (qdep = QI_DEV_IOTLB_MAX_INVS)
+   qdep = 0;
+
+   desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
+  QI_DIOTLB_TYPE;
+
+   return qi_submit_sync(desc, iommu);
+}
+
 /*
  * Enable Queued Invalidation interface. This is a must to support
  * interrupt-remapping. Also used by DMA-remapping, which replaces
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 0a220c9..d82bdac 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 #define DMA_FSTS_PPF ((u32)2)
 #define DMA_FSTS_PFO ((u32)1)
 #define DMA_FSTS_IQE (1  4)
+#define DMA_FSTS_ICE (1  5)
+#define DMA_FSTS_ITE (1  6)
 #define dma_fsts_fault_record_index(s) (((s)  8)  0xff)
 
 /* FRCD_REG, 32 bits access */
@@ -224,7 +226,8 @@ do {
\
 enum {
QI_FREE,
QI_IN_USE,
-   QI_DONE
+   QI_DONE,
+   QI_ABORT
 };
 
 #define

[PATCH v2 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-01-17 Thread Yu Zhao

Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c |   46 +---
 1 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 3dfecb2..df92764 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -917,30 +917,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
-   unsigned int mask;
+   int rc;
+   unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 
BUG_ON(addr  (~VTD_PAGE_MASK));
BUG_ON(pages == 0);
 
-   /* Fallback to domain selective flush if no PSI support */
-   if (!cap_pgsel_inv(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH,
-   non_present_entry_flush);
-
/*
+* Fallback to domain selective flush if no PSI support or the size is
+* too big.
 * PSI requires page size to be 2 ^ x, and the base address is naturally
 * aligned to the size
 */
-   mask = ilog2(__roundup_pow_of_two(pages));
-   /* Fallback to domain selective flush if size is too big */
-   if (mask  cap_max_amask_val(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH, non_present_entry_flush);
-
-   return iommu-flush.flush_iotlb(iommu, did, addr, mask,
-   DMA_TLB_PSI_FLUSH,
-   non_present_entry_flush);
+   if (!cap_pgsel_inv(iommu-cap) || mask  cap_max_amask_val(iommu-cap))
+   rc = iommu-flush.flush_iotlb(iommu, did, 0, 0,
+   DMA_TLB_DSI_FLUSH,
+   non_present_entry_flush);
+   else
+   rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
+   DMA_TLB_PSI_FLUSH,
+   non_present_entry_flush);
+   return rc;
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2293,15 +2290,16 @@ static void flush_unmaps(void)
if (!iommu)
continue;
 
-   if (deferred_flush[i].next) {
-   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
-DMA_TLB_GLOBAL_FLUSH, 0);
-   for (j = 0; j  deferred_flush[i].next; j++) {
-   __free_iova(deferred_flush[i].domain[j]-iovad,
-   deferred_flush[i].iova[j]);
-   }
-   deferred_flush[i].next = 0;
+   if (!deferred_flush[i].next)
+   continue;
+
+   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
+DMA_TLB_GLOBAL_FLUSH, 0);
+   for (j = 0; j  deferred_flush[i].next; j++) {
+   __free_iova(deferred_flush[i].domain[j]-iovad,
+   deferred_flush[i].iova[j]);
}
+   deferred_flush[i].next = 0;
}
 
list_size = 0;
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] ATS capability support for Intel IOMMU

2009-01-07 Thread Yu Zhao

This patch series implements Address Translation Service support for
the Intel IOMMU. ATS provides ability for the PCI Endpoint to request
the DMA address translation from the IOMMU and cache the translation
in the Endpoint to alleviate IOMMU pressure and improve the hardware
performance in the I/O virtualization environment.

[PATCH 1/6] PCI: support the ATS capability
[PATCH 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure
[PATCH 3/6] VT-d: add queue invalidation fault status support
[PATCH 4/6] VT-d: add device IOTLB invalidation support
[PATCH 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
[PATCH 6/6] VT-d: support the device IOTLB
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] PCI: support the ATS capability

2009-01-07 Thread Yu Zhao

The ATS spec can be found at http://www.pcisig.com/specifications/iov/ats/
(it requires membership).

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/pci.c|   68 ++
 include/linux/pci.h  |   15 ++
 include/linux/pci_regs.h |   10 +++
 3 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 061d1ee..5abab14 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1337,6 +1337,74 @@ void pci_enable_ari(struct pci_dev *dev)
bridge-ari_enabled = 1;
 }
 
+/**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @ps: the IOMMU page shift
+ *
+ * Returns 0 on success, or a negative value on error.
+ */
+int pci_enable_ats(struct pci_dev *dev, int ps)
+{
+   int pos;
+   u16 ctrl;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   if (ps  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   ctrl = PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU) | PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+
+   dev-ats_enabled = 1;
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   int pos;
+   u16 ctrl;
+
+   if (!dev-ats_enabled)
+   return;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, pos + PCI_ATS_CTRL, ctrl);
+}
+
+/**
+ * pci_ats_qdep - query ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or 0 on error.
+ */
+int pci_ats_qdep(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return 0;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+
+   return PCI_ATS_CAP_QDEP(cap) ? : PCI_ATS_MAX_QDEP;
+}
+
 int
 pci_get_interrupt_pin(struct pci_dev *dev, struct pci_dev **bridge)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4bb156b..e6a1b5a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -227,6 +227,7 @@ struct pci_dev {
unsigned intmsi_enabled:1;
unsigned intmsix_enabled:1;
unsigned intari_enabled:1;  /* ARI forwarding */
+   unsigned intats_enabled:1;  /* Address Translation Service */
unsigned intis_managed:1;
unsigned intis_pcie:1;
pci_dev_flags_t dev_flags;
@@ -1155,5 +1156,19 @@ static inline void __iomem *pci_ioremap_bar(struct 
pci_dev *pdev, int bar)
 }
 #endif
 
+extern int pci_enable_ats(struct pci_dev *dev, int ps);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_qdep(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return dev-ats_enabled;
+}
+
 #endif /* __KERNEL__ */
 #endif /* LINUX_PCI_H */
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index e5effd4..00c9db5 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -436,6 +436,7 @@
 #define PCI_EXT_CAP_ID_DSN 3
 #define PCI_EXT_CAP_ID_PWR 4
 #define PCI_EXT_CAP_ID_ARI 14
+#define PCI_EXT_CAP_ID_ATS 15
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS   4   /* Uncorrectable Error Status */
@@ -553,4 +554,13 @@
 #define  PCI_ARI_CTRL_ACS  0x0002  /* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)(((x)  4)  7) /* Function Group */
 
+/* Address Translation Service */
+#define PCI_ATS_CAP0x04/* ATS Capability Register */
+#define  PCI_ATS_CAP_QDEP(x)   ((x)  0x1f)/* Invalidate Queue Depth */
+#define  PCI_ATS_MAX_QDEP  32  /* Max Invalidate Queue Depth */
+#define PCI_ATS_CTRL   0x06/* ATS Control Register */
+#define  PCI_ATS_CTRL_ENABLE   0x8000  /* ATS Enable */
+#define  PCI_ATS_CTRL_STU(x)   ((x)  0x1f)/* Smallest Translation Unit */
+#define  PCI_ATS_MIN_STU   12  /* shift of minimum STU block */
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-01-07 Thread Yu Zhao

Parse the Root Port ATS Capability Reporting Structure in DMA Remapping
Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/dmar.c  |  114 --
 include/linux/dmar.h|9 +++
 include/linux/intel-iommu.h |1 +
 3 files changed, 118 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f5a662a..f2859d1 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -254,6 +254,86 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   if (atsru-include_all)
+   list_add_tail(atsru-list, dmar_atsr_units);
+   else
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int ret = 0;
+   struct acpi_dmar_atsr *atsr;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (!atsru-include_all)
+   ret = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+
+   if (ret || !(atsru-include_all || atsru-devices_cnt)) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return ret;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -261,22 +341,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -341,6 +427,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n

[PATCH 3/6] VT-d: add queue invalidation fault status support

2009-01-07 Thread Yu Zhao

Check fault register after submitting an queue invalidation request.

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/dmar.c   |   59 +++--
 drivers/pci/intr_remapping.c |   21 --
 include/linux/intel-iommu.h  |4 ++-
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f2859d1..eb77258 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -673,19 +673,49 @@ static inline void reclaim_free_desc(struct q_inval *qi)
}
 }
 
+static int qi_check_fault(struct intel_iommu *iommu, int index)
+{
+   u32 fault;
+   int head;
+   struct q_inval *qi = iommu-qi;
+   int wait_index = (index + 1) % QI_LENGTH;
+
+   fault = readl(iommu-reg + DMAR_FSTS_REG);
+
+   /*
+* If IQE happens, the head points to the descriptor associated
+* with the error. No new descriptors are fetched until the IQE
+* is cleared.
+*/
+   if (fault  DMA_FSTS_IQE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   if ((head  DMAR_IQ_OFFSET) == index) {
+   memcpy(qi-desc[index], qi-desc[wait_index],
+   sizeof(struct qi_desc));
+   __iommu_flush_cache(iommu, qi-desc[index],
+   sizeof(struct qi_desc));
+   writel(DMA_FSTS_IQE, iommu-reg + DMAR_FSTS_REG);
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 /*
  * Submit the queued invalidation descriptor to the remapping
  * hardware unit and wait for its completion.
  */
-void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
+int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
+   int rc = 0;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
unsigned long flags;
 
if (!qi)
-   return;
+   return 0;
 
hw = qi-desc;
 
@@ -703,7 +733,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
hw[index] = *desc;
 
-   wait_desc.low = QI_IWD_STATUS_DATA(2) | QI_IWD_STATUS_WRITE | 
QI_IWD_TYPE;
+   wait_desc.low = QI_IWD_STATUS_DATA(QI_DONE) |
+   QI_IWD_STATUS_WRITE | QI_IWD_TYPE;
wait_desc.high = virt_to_phys(qi-desc_status[wait_index]);
 
hw[wait_index] = wait_desc;
@@ -714,13 +745,11 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
qi-free_head = (qi-free_head + 2) % QI_LENGTH;
qi-free_cnt -= 2;
 
-   spin_lock(iommu-register_lock);
/*
 * update the HW tail register indicating the presence of
 * new descriptors.
 */
-   writel(qi-free_head  4, iommu-reg + DMAR_IQT_REG);
-   spin_unlock(iommu-register_lock);
+   writel(qi-free_head  DMAR_IQ_OFFSET, iommu-reg + DMAR_IQT_REG);
 
while (qi-desc_status[wait_index] != QI_DONE) {
/*
@@ -730,6 +759,10 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 * a deadlock where the interrupt context can wait indefinitely
 * for free slots in the queue.
 */
+   rc = qi_check_fault(iommu, index);
+   if (rc)
+   break;
+
spin_unlock(qi-q_lock);
cpu_relax();
spin_lock(qi-q_lock);
@@ -739,6 +772,8 @@ void qi_submit_sync(struct qi_desc *desc, struct 
intel_iommu *iommu)
 
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
+
+   return rc;
 }
 
 /*
@@ -751,13 +786,13 @@ void qi_global_iec(struct intel_iommu *iommu)
desc.low = QI_IEC_TYPE;
desc.high = 0;
 
+   /* should never fail */
qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
 u64 type, int non_present_entry_flush)
 {
-
struct qi_desc desc;
 
if (non_present_entry_flush) {
@@ -771,10 +806,7 @@ int qi_flush_context(struct intel_iommu *iommu, u16 did, 
u16 sid, u8 fm,
| QI_CC_GRAN(type) | QI_CC_TYPE;
desc.high = 0;
 
-   qi_submit_sync(desc, iommu);
-
-   return 0;
-
+   return qi_submit_sync(desc, iommu);
 }
 
 int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
@@ -804,10 +836,7 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
desc.high = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih)
| QI_IOTLB_AM(size_order);
 
-   qi_submit_sync(desc, iommu);
-
-   return 0;
-
+   return qi_submit_sync(desc, iommu);
 }
 
 /*
diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c
index f78371b..45effc5 100644
--- a/drivers/pci/intr_remapping.c
+++ b/drivers/pci/intr_remapping.c

[PATCH 4/6] VT-d: add device IOTLB invalidation support

2009-01-07 Thread Yu Zhao

Support device IOTLB invalidation to flush the translation cached in the
Endpoint.

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/dmar.c  |   63 --
 include/linux/intel-iommu.h |   13 -
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index eb77258..88f6b1f 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -666,7 +666,8 @@ void free_iommu(struct intel_iommu *iommu)
  */
 static inline void reclaim_free_desc(struct q_inval *qi)
 {
-   while (qi-desc_status[qi-free_tail] == QI_DONE) {
+   while (qi-desc_status[qi-free_tail] == QI_DONE ||
+  qi-desc_status[qi-free_tail] == QI_ABORT) {
qi-desc_status[qi-free_tail] = QI_FREE;
qi-free_tail = (qi-free_tail + 1) % QI_LENGTH;
qi-free_cnt++;
@@ -676,10 +677,13 @@ static inline void reclaim_free_desc(struct q_inval *qi)
 static int qi_check_fault(struct intel_iommu *iommu, int index)
 {
u32 fault;
-   int head;
+   int head, tail;
struct q_inval *qi = iommu-qi;
int wait_index = (index + 1) % QI_LENGTH;
 
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+
fault = readl(iommu-reg + DMAR_FSTS_REG);
 
/*
@@ -699,6 +703,32 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
}
}
 
+   /*
+* If ITE happens, all pending wait_desc commands are aborted.
+* No new descriptors are fetched until the ITE is cleared.
+*/
+   if (fault  DMA_FSTS_ITE) {
+   head = readl(iommu-reg + DMAR_IQH_REG);
+   head = ((head  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+   head |= 1;
+   tail = readl(iommu-reg + DMAR_IQT_REG);
+   tail = ((tail  DMAR_IQ_OFFSET) - 1 + QI_LENGTH) % QI_LENGTH;
+
+   writel(DMA_FSTS_ITE, iommu-reg + DMAR_FSTS_REG);
+
+   do {
+   if (qi-desc_status[head] == QI_IN_USE)
+   qi-desc_status[head] = QI_ABORT;
+   head = (head - 2 + QI_LENGTH) % QI_LENGTH;
+   } while (head != tail);
+
+   if (qi-desc_status[wait_index] == QI_ABORT)
+   return -EAGAIN;
+   }
+
+   if (fault  DMA_FSTS_ICE)
+   writel(DMA_FSTS_ICE, iommu-reg + DMAR_FSTS_REG);
+
return 0;
 }
 
@@ -708,7 +738,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
  */
 int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu)
 {
-   int rc = 0;
+   int rc;
struct q_inval *qi = iommu-qi;
struct qi_desc *hw, wait_desc;
int wait_index, index;
@@ -719,6 +749,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
 
hw = qi-desc;
 
+restart:
+   rc = 0;
+
spin_lock_irqsave(qi-q_lock, flags);
while (qi-free_cnt  3) {
spin_unlock_irqrestore(qi-q_lock, flags);
@@ -773,6 +806,9 @@ int qi_submit_sync(struct qi_desc *desc, struct intel_iommu 
*iommu)
reclaim_free_desc(qi);
spin_unlock_irqrestore(qi-q_lock, flags);
 
+   if (rc == -EAGAIN)
+   goto restart;
+
return rc;
 }
 
@@ -839,6 +875,27 @@ int qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 
addr,
return qi_submit_sync(desc, iommu);
 }
 
+int qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, int qdep,
+   u64 addr, unsigned int mask)
+{
+   struct qi_desc desc;
+
+   if (mask) {
+   BUG_ON(addr  ((1  (VTD_PAGE_SHIFT + mask)) - 1));
+   addr |= (1  (VTD_PAGE_SHIFT + mask - 1)) - 1;
+   desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
+   } else
+   desc.high = QI_DEV_IOTLB_ADDR(addr);
+
+   if (qdep = QI_DEV_IOTLB_MAX_INVS)
+   qdep = 0;
+
+   desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
+  QI_DIOTLB_TYPE;
+
+   return qi_submit_sync(desc, iommu);
+}
+
 /*
  * Enable Queued Invalidation interface. This is a must to support
  * interrupt-remapping. Also used by DMA-remapping, which replaces
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 0a220c9..d82bdac 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -196,6 +196,8 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 #define DMA_FSTS_PPF ((u32)2)
 #define DMA_FSTS_PFO ((u32)1)
 #define DMA_FSTS_IQE (1  4)
+#define DMA_FSTS_ICE (1  5)
+#define DMA_FSTS_ITE (1  6)
 #define dma_fsts_fault_record_index(s) (((s)  8)  0xff)
 
 /* FRCD_REG, 32 bits access */
@@ -224,7 +226,8 @@ do {
\
 enum {
QI_FREE,
QI_IN_USE,
-   QI_DONE
+   QI_DONE,
+   QI_ABORT
 };
 
 #define

[PATCH 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-01-07 Thread Yu Zhao

Make iommu_flush_iotlb_psi() and flush_unmaps() easier to read.

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/intel-iommu.c |   46 +---
 1 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 235fb7a..261b6bd 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -916,30 +916,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
-   unsigned int mask;
+   int rc;
+   unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 
BUG_ON(addr  (~VTD_PAGE_MASK));
BUG_ON(pages == 0);
 
-   /* Fallback to domain selective flush if no PSI support */
-   if (!cap_pgsel_inv(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH,
-   non_present_entry_flush);
-
/*
+* Fallback to domain selective flush if no PSI support or the size is
+* too big.
 * PSI requires page size to be 2 ^ x, and the base address is naturally
 * aligned to the size
 */
-   mask = ilog2(__roundup_pow_of_two(pages));
-   /* Fallback to domain selective flush if size is too big */
-   if (mask  cap_max_amask_val(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH, non_present_entry_flush);
-
-   return iommu-flush.flush_iotlb(iommu, did, addr, mask,
-   DMA_TLB_PSI_FLUSH,
-   non_present_entry_flush);
+   if (!cap_pgsel_inv(iommu-cap) || mask  cap_max_amask_val(iommu-cap))
+   rc = iommu-flush.flush_iotlb(iommu, did, 0, 0,
+   DMA_TLB_DSI_FLUSH,
+   non_present_entry_flush);
+   else
+   rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
+   DMA_TLB_PSI_FLUSH,
+   non_present_entry_flush);
+   return rc;
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2292,15 +2289,16 @@ static void flush_unmaps(void)
if (!iommu)
continue;
 
-   if (deferred_flush[i].next) {
-   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
-DMA_TLB_GLOBAL_FLUSH, 0);
-   for (j = 0; j  deferred_flush[i].next; j++) {
-   __free_iova(deferred_flush[i].domain[j]-iovad,
-   deferred_flush[i].iova[j]);
-   }
-   deferred_flush[i].next = 0;
+   if (!deferred_flush[i].next)
+   continue;
+
+   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
+DMA_TLB_GLOBAL_FLUSH, 0);
+   for (j = 0; j  deferred_flush[i].next; j++) {
+   __free_iova(deferred_flush[i].domain[j]-iovad,
+   deferred_flush[i].iova[j]);
}
+   deferred_flush[i].next = 0;
}
 
list_size = 0;
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6] VT-d: support the device IOTLB

2009-01-07 Thread Yu Zhao

Support device IOTLB (i.e. ATS) for both native and KVM environments.

Signed-off-by: Yu Zhao yu.z...@intel.com

---
 drivers/pci/intel-iommu.c   |   97 +-
 include/linux/intel-iommu.h |1 +
 2 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 261b6bd..a7ff7cb 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -125,6 +125,7 @@ static inline void context_set_fault_enable(struct 
context_entry *context)
 }
 
 #define CONTEXT_TT_MULTI_LEVEL 0
+#define CONTEXT_TT_DEV_IOTLB   1
 
 static inline void context_set_translation_type(struct context_entry *context,
unsigned long value)
@@ -240,6 +241,8 @@ struct device_domain_info {
struct list_head global; /* link to global list */
u8 bus; /* PCI bus numer */
u8 devfn;   /* PCI devfn number */
+   int qdep;   /* invalidate queue depth */
+   struct intel_iommu *iommu; /* IOMMU used by this device */
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct dmar_domain *domain; /* pointer to domain */
 };
@@ -913,6 +916,75 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
return 0;
 }
 
+static struct device_domain_info *
+iommu_support_dev_iotlb(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+   int found = 0;
+   unsigned long flags;
+   struct device_domain_info *info;
+   struct intel_iommu *iommu = device_to_iommu(bus, devfn);
+
+   if (!ecap_dev_iotlb_support(iommu-ecap))
+   return NULL;
+
+   if (!iommu-qi)
+   return NULL;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link)
+   if (info-dev  info-bus == bus  info-devfn == devfn) {
+   found = 1;
+   break;
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+
+   if (!found)
+   return NULL;
+
+   if (!dmar_find_matched_atsr_unit(info-dev))
+   return NULL;
+
+   info-iommu = iommu;
+   info-qdep = pci_ats_qdep(info-dev);
+   if (!info-qdep)
+   return NULL;
+
+   return info;
+}
+
+static void iommu_enable_dev_iotlb(struct device_domain_info *info)
+{
+   pci_enable_ats(info-dev, VTD_PAGE_SHIFT);
+}
+
+static void iommu_disable_dev_iotlb(struct device_domain_info *info)
+{
+   if (info-dev  pci_ats_enabled(info-dev))
+   pci_disable_ats(info-dev);
+}
+
+static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
+ u64 addr, unsigned int mask)
+{
+   int rc;
+   u16 sid;
+   unsigned long flags;
+   struct device_domain_info *info;
+
+   spin_lock_irqsave(device_domain_lock, flags);
+   list_for_each_entry(info, domain-devices, link) {
+   if (!info-dev || !pci_ats_enabled(info-dev))
+   continue;
+
+   sid = info-bus  8 | info-devfn;
+   rc = qi_flush_dev_iotlb(info-iommu, sid,
+   info-qdep, addr, mask);
+   if (rc)
+   printk(KERN_ERR IOMMU: flush device IOTLB failed\n);
+   }
+   spin_unlock_irqrestore(device_domain_lock, flags);
+}
+
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
@@ -936,6 +1008,9 @@ static int iommu_flush_iotlb_psi(struct intel_iommu 
*iommu, u16 did,
rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
DMA_TLB_PSI_FLUSH,
non_present_entry_flush);
+   if (!rc  !non_present_entry_flush)
+   iommu_flush_dev_iotlb(iommu-domains[did], addr, mask);
+
return rc;
 }
 
@@ -1460,6 +1535,7 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
unsigned long ndomains;
int id;
int agaw;
+   struct device_domain_info *info;
 
pr_debug(Set context mapping for %02x:%02x.%d\n,
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -1525,7 +1601,11 @@ static int domain_context_mapping_one(struct dmar_domain 
*domain,
context_set_domain_id(context, id);
context_set_address_width(context, iommu-agaw);
context_set_address_root(context, virt_to_phys(pgd));
-   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
+   info = iommu_support_dev_iotlb(domain, bus, devfn);
+   if (info)
+   context_set_translation_type(context, CONTEXT_TT_DEV_IOTLB);
+   else
+   context_set_translation_type(context, CONTEXT_TT_MULTI_LEVEL);
context_set_fault_enable(context);
context_set_present(context

Re: [SR-IOV driver example 0/3] introduction

2008-12-01 Thread Yu Zhao

On Thu, Nov 27, 2008 at 04:14:48AM +0800, Jeff Garzik wrote:
 Yu Zhao wrote:
  SR-IOV drivers of Intel 82576 NIC are available. There are two parts
  of the drivers: Physical Function driver and Virtual Function driver.
  The PF driver is based on the IGB driver and is used to control PF to
  allocate hardware specific resources and interface with the SR-IOV core.
  The VF driver is a new NIC driver that is same as the traditional PCI
  device driver. It works in both the host and the guest (Xen and KVM)
  environment.
  
  These two drivers are testing versions and they are *only* intended to
  show how to use SR-IOV API.
  
  Intel 82576 NIC specification can be found at:
  http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
  
  [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
  [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
  [SR-IOV driver example 3/3] VF driver tar ball
 
 Please copy [EMAIL PROTECTED] on all network-related patches.  This 
 is where the network developers live, and all patches on this list are 
 automatically archived for review and handling at 
 http://patchwork.ozlabs.org/project/netdev/list/

Will do.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-12-01 Thread Yu Zhao

On Thu, Nov 27, 2008 at 12:58:59AM +0800, Greg KH wrote:
 On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
  +   my_mac_addr[5] = (unsigned char)i;
  +   igb_set_vf_mac(netdev, i, my_mac_addr);
  +   igb_set_vf_vmolr(adapter, i);
  +   }
  +   } else
  +   printk(KERN_INFO SR-IOV is disabled\n);
 
 Is that really true?  (oh, use dev_info as well.)  What happens if you
 had called this with 5 and then later with 0, you never destroyed
 those existing virtual functions, yet the code does:
 
  +   adapter-vfs_allocated_count = nr_virtfn;
 
 Which makes the driver think they are not present.  What happens when
 the driver later goes to shut down?  Are those resources freed up
 properly?

For now we hard-code the tx/rx queues allocation so this doesn't
matter. Eventually this will become dynamic allocation: when number
of VFs changes the corresponding resources need to be freed.

I'll put more comments here.

Thanks,
Yu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SR-IOV driver example 0/3] introduction

2008-12-01 Thread Yu Zhao

On Thu, Nov 27, 2008 at 12:59:33AM +0800, Greg KH wrote:
 On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote:
  SR-IOV drivers of Intel 82576 NIC are available. There are two parts
  of the drivers: Physical Function driver and Virtual Function driver.
  The PF driver is based on the IGB driver and is used to control PF to
  allocate hardware specific resources and interface with the SR-IOV core.
  The VF driver is a new NIC driver that is same as the traditional PCI
  device driver. It works in both the host and the guest (Xen and KVM)
  environment.
  
  These two drivers are testing versions and they are *only* intended to
  show how to use SR-IOV API.
 
 That's funny, as some distros are already shipping this driver.  You
 might want to tell them that this is an example only driver and not to
 be used for real... :(

Maybe they are shipping another version, not this one. This one is really
a experimental patch, it's just created a week before...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[SR-IOV driver example 0/3 resend] introduction

2008-12-01 Thread Yu Zhao

SR-IOV drivers of Intel 82576 NIC are available. There are two parts
of the drivers: Physical Function driver and Virtual Function driver.
The PF driver is based on the IGB driver and is used to control PF to
allocate hardware specific resources and interface with the SR-IOV core.
The VF driver is a new NIC driver that is same as the traditional PCI
device driver. It works in both the host and the guest (Xen and KVM)
environment.

These two drivers are testing versions and they are *only* intended to
show how to use SR-IOV API.

Intel 82576 NIC specification can be found at:
http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

[SR-IOV driver example 0/3 resend] introduction
[SR-IOV driver example 1/3 resend] PF driver: hardware specific operations
[SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core
[SR-IOV driver example 3/3 resend] VF driver: an independent PCI NIC driver
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 164 matches

Mail list logo