Re: [PATCH v1 2/2] mm/memory_hotplug: remove is_mem_section_removable()

2020-04-07 Thread Wei Yang
On Tue, Apr 07, 2020 at 03:54:16PM +0200, David Hildenbrand wrote:
>Fortunately, all users of is_mem_section_removable() are gone. Get rid of
>it, including some now unnecessary functions.
>
>Cc: Michael Ellerman 
>Cc: Benjamin Herrenschmidt 
>Cc: Michal Hocko 
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Nice.

Reviewed-by: Wei Yang 

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Wei Yang
On Thu, Mar 26, 2020 at 07:02:35PM +0530, Aneesh Kumar K.V wrote:
>Fixes the below crash
>
>BUG: Kernel NULL pointer dereference on read at 0x
>Faulting instruction address: 0xc0c3447c
>Oops: Kernel access of bad area, sig: 11 [#1]
>LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
>...
>NIP [c0c3447c] vmemmap_populated+0x98/0xc0
>LR [c0088354] vmemmap_free+0x144/0x320
>Call Trace:
> section_deactivate+0x220/0x240
> __remove_pages+0x118/0x170
> arch_remove_memory+0x3c/0x150
> memunmap_pages+0x1cc/0x2f0
> devm_action_release+0x30/0x50
> release_nodes+0x2f8/0x3e0
> device_release_driver_internal+0x168/0x270
> unbind_store+0x130/0x170
> drv_attr_store+0x44/0x60
> sysfs_kf_write+0x68/0x80
> kernfs_fop_write+0x100/0x290
> __vfs_write+0x3c/0x70
> vfs_write+0xcc/0x240
> ksys_write+0x7c/0x140
> system_call+0x5c/0x68
>
>The crash is due to NULL dereference at
>
>test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
>pfn_section_valid()
>
>With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
>SPARSEMEM|!VMEMMAP case")
>section_mem_map is set to NULL after depopulate_section_mem(). This
>was done so that pfn_page() can work correctly with kernel config that disables
>SPARSEMEM_VMEMMAP. With that config pfn_to_page does
>
>   __section_mem_map_addr(__sec) + __pfn;
>where
>
>static inline struct page *__section_mem_map_addr(struct mem_section *section)
>{
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
>}
>
>Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
>check the pfn validity (pfn_valid()). Since section_deactivate release
>mem_section->usage if a section is fully deactivated, pfn_valid() check after
>a subsection_deactivate cause a kernel crash.
>
>static inline int pfn_valid(unsigned long pfn)
>{
>...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
>}
>
>where
>
>static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>{
>   int idx = subsection_map_index(pfn);
>
>   return test_bit(idx, ms->usage->subsection_map);
>}
>
>Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
>For architectures like ppc64 where large pages are used for vmmemap mapping 
>(16MB),
>a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
>mapping page can be freed, the kernel needs to make sure there are no valid 
>sections
>within that mapping. Clearing the section valid bit before
>depopulate_section_memap enables this.
>
>Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP 
>case")
>Reported-by: Sachin Sant 
>Tested-by: Sachin Sant 
>Cc: Baoquan He 
>Cc: Michael Ellerman 
>Cc: Dan Williams 
>Cc: Pankaj Gupta 
>Cc: David Hildenbrand 
>Cc: Michal Hocko 
>Cc: Wei Yang 
>Cc: Oscar Salvador 
>Cc: Mike Rapoport 
>Cc: 
>Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Wei Yang 

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 6/8] mm/memory_hotplug: unexport memhp_auto_online

2020-03-17 Thread Wei Yang
On Tue, Mar 17, 2020 at 11:49:40AM +0100, David Hildenbrand wrote:
>All in-tree users except the mm-core are gone. Let's drop the export.
>
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> mm/memory_hotplug.c | 1 -
> 1 file changed, 1 deletion(-)
>
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 1a00b5a37ef6..2d2aae830b92 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -71,7 +71,6 @@ bool memhp_auto_online;
> #else
> bool memhp_auto_online = true;
> #endif
>-EXPORT_SYMBOL_GPL(memhp_auto_online);
> 
> static int __init setup_memhp_default_state(char *str)
> {
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 4/8] powernv/memtrace: always online added memory blocks

2020-03-17 Thread Wei Yang
On Tue, Mar 17, 2020 at 11:49:38AM +0100, David Hildenbrand wrote:
>Let's always try to online the re-added memory blocks. In case add_memory()
>already onlined the added memory blocks, the first device_online() call
>will fail and stop processing the remaining memory blocks.
>
>This avoids manually having to check memhp_auto_online.
>
>Note: PPC always onlines all hotplugged memory directly from the kernel
>as well - something that is handled by user space on other
>architectures.
>
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: Andrew Morton 
>Cc: Greg Kroah-Hartman 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Cc: linuxppc-dev@lists.ozlabs.org
>Signed-off-by: David Hildenbrand 

Looks good.

Reviewed-by: Wei Yang 

>---
> arch/powerpc/platforms/powernv/memtrace.c | 14 --
> 1 file changed, 4 insertions(+), 10 deletions(-)
>
>diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
>b/arch/powerpc/platforms/powernv/memtrace.c
>index d6d64f8718e6..13b369d2cc45 100644
>--- a/arch/powerpc/platforms/powernv/memtrace.c
>+++ b/arch/powerpc/platforms/powernv/memtrace.c
>@@ -231,16 +231,10 @@ static int memtrace_online(void)
>   continue;
>   }
> 
>-  /*
>-   * If kernel isn't compiled with the auto online option
>-   * we need to online the memory ourselves.
>-   */
>-  if (!memhp_auto_online) {
>-  lock_device_hotplug();
>-  walk_memory_blocks(ent->start, ent->size, NULL,
>- online_mem_block);
>-  unlock_device_hotplug();
>-  }
>+  lock_device_hotplug();
>+  walk_memory_blocks(ent->start, ent->size, NULL,
>+ online_mem_block);
>+  unlock_device_hotplug();
> 
>   /*
>* Memory was added successfully so clean up references to it
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v1 3/5] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 02:20:02PM +, Wei Yang wrote:
>On Wed, Mar 11, 2020 at 01:30:24PM +0100, David Hildenbrand wrote:
>>Let's use a simple array which we can reuse soon. While at it, move the
>>string->mmop conversion out of the device hotplug lock.
>>
>>Cc: Greg Kroah-Hartman 
>>Cc: Andrew Morton 
>>Cc: Michal Hocko 
>>Cc: Oscar Salvador 
>>Cc: "Rafael J. Wysocki" 
>>Cc: Baoquan He 
>>Cc: Wei Yang 
>>Signed-off-by: David Hildenbrand 

Ok, I got the reason.

Reviewed-by: Wei Yang 

>>---
>> drivers/base/memory.c | 38 +++---
>> 1 file changed, 23 insertions(+), 15 deletions(-)
>>
>>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>index e7e77cafef80..8a7f29c0bf97 100644
>>--- a/drivers/base/memory.c
>>+++ b/drivers/base/memory.c
>>@@ -28,6 +28,24 @@
>> 
>> #define MEMORY_CLASS_NAME"memory"
>> 
>>+static const char *const online_type_to_str[] = {
>>+ [MMOP_OFFLINE] = "offline",
>>+ [MMOP_ONLINE] = "online",
>>+ [MMOP_ONLINE_KERNEL] = "online_kernel",
>>+ [MMOP_ONLINE_MOVABLE] = "online_movable",
>>+};
>>+
>>+static int memhp_online_type_from_str(const char *str)
>>+{
>>+ int i;
>>+
>>+ for (i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
>>+ if (sysfs_streq(str, online_type_to_str[i]))
>>+ return i;
>>+ }
>>+ return -EINVAL;
>>+}
>>+
>> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>> 
>> static int sections_per_block;
>>@@ -236,26 +254,17 @@ static int memory_subsys_offline(struct device *dev)
>> static ssize_t state_store(struct device *dev, struct device_attribute *attr,
>> const char *buf, size_t count)
>> {
>>+ const int online_type = memhp_online_type_from_str(buf);
>
>In your following patch, you did the same conversion. Is it possible to merge
>them into this one?
>
>>  struct memory_block *mem = to_memory_block(dev);
>>- int ret, online_type;
>>+ int ret;
>>+
>>+ if (online_type < 0)
>>+ return -EINVAL;
>> 
>>  ret = lock_device_hotplug_sysfs();
>>  if (ret)
>>  return ret;
>> 
>>- if (sysfs_streq(buf, "online_kernel"))
>>- online_type = MMOP_ONLINE_KERNEL;
>>- else if (sysfs_streq(buf, "online_movable"))
>>- online_type = MMOP_ONLINE_MOVABLE;
>>- else if (sysfs_streq(buf, "online"))
>>- online_type = MMOP_ONLINE;
>>- else if (sysfs_streq(buf, "offline"))
>>- online_type = MMOP_OFFLINE;
>>- else {
>>- ret = -EINVAL;
>>- goto err;
>>- }
>>-
>>  switch (online_type) {
>>  case MMOP_ONLINE_KERNEL:
>>  case MMOP_ONLINE_MOVABLE:
>>@@ -271,7 +280,6 @@ static ssize_t state_store(struct device *dev, struct 
>>device_attribute *attr,
>>  ret = -EINVAL; /* should never happen */
>>  }
>> 
>>-err:
>>  unlock_device_hotplug();
>> 
>>  if (ret < 0)
>>-- 
>>2.24.1
>
>-- 
>Wei Yang
>Help you, Help me

-- 
Wei Yang
Help you, Help me


Re: [PATCH v1 5/5] mm/memory_hotplug: allow to specify a default online_type

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 01:30:26PM +0100, David Hildenbrand wrote:
>For now, distributions implement advanced udev rules to essentially
>- Don't online any hotplugged memory (s390x)
>- Online all memory to ZONE_NORMAL (e.g., most virt environments like
>  hyperv)
>- Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
>  care of (e.g., bare metal, special virt environments)
>
>In summary: All memory is usually onlined the same way, however, the
>kernel always has to ask userspace to come up with the same answer.
>E.g., HyperV always waits for a memory block to get onlined before
>continuing, otherwise it might end up adding memory faster than
>hotplugging it, which can result in strange OOM situations.
>
>Let's allow to specify a default online_type, not just "online" and
>"offline". This allows distributions to configure the default online_type
>when booting up and be done with it.
>
>We can now specify "offline", "online", "online_movable" and
>"online_kernel" via
>- "memhp_default_state=" on the kernel cmdline
>- /sys/devices/systemn/memory/auto_online_blocks
>just like we are able to specify for a single memory block via
>/sys/devices/systemn/memory/memoryX/state
>
>Cc: Greg Kroah-Hartman 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Ok, I got the reason to leave the change on string compare here.

Reviewed-by: Wei Yang 

>---
> drivers/base/memory.c  | 11 +--
> include/linux/memory_hotplug.h |  2 ++
> mm/memory_hotplug.c|  8 
> 3 files changed, 11 insertions(+), 10 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 8d3e16dab69f..2b09b68b9f78 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -35,7 +35,7 @@ static const char *const online_type_to_str[] = {
>   [MMOP_ONLINE_MOVABLE] = "online_movable",
> };
> 
>-static int memhp_online_type_from_str(const char *str)
>+int memhp_online_type_from_str(const char *str)
> {
>   int i;
> 
>@@ -394,13 +394,12 @@ static ssize_t auto_online_blocks_store(struct device 
>*dev,
>   struct device_attribute *attr,
>   const char *buf, size_t count)
> {
>-  if (sysfs_streq(buf, "online"))
>-  memhp_default_online_type = MMOP_ONLINE;
>-  else if (sysfs_streq(buf, "offline"))
>-  memhp_default_online_type = MMOP_OFFLINE;
>-  else
>+  const int online_type = memhp_online_type_from_str(buf);
>+
>+  if (online_type < 0)
>   return -EINVAL;
> 
>+  memhp_default_online_type = online_type;
>   return count;
> }
> 
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index c6e090b34c4b..ef55115320fb 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -117,6 +117,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
>   struct mhp_restrictions *restrictions);
> extern u64 max_mem_size;
> 
>+extern int memhp_online_type_from_str(const char *str);
>+
> /* Default online_type (MMOP_*) when new memory blocks are added. */
> extern int memhp_default_online_type;
> /* If movable_node boot option specified */
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 01443c70aa27..4a96273eafa7 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -75,10 +75,10 @@ EXPORT_SYMBOL_GPL(memhp_default_online_type);
> 
> static int __init setup_memhp_default_state(char *str)
> {
>-      if (!strcmp(str, "online"))
>-  memhp_default_online_type = MMOP_ONLINE;
>-  else if (!strcmp(str, "offline"))
>-  memhp_default_online_type = MMOP_OFFLINE;
>+  const int online_type = memhp_online_type_from_str(str);
>+
>+  if (online_type >= 0)
>+  memhp_default_online_type = online_type;
> 
>   return 1;
> }
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v1 4/5] mm/memory_hotplug: convert memhp_auto_online to store an online_type

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 01:30:25PM +0100, David Hildenbrand wrote:
>... and rename it to memhp_default_online_type. This is a preparation
>for more detailed default online behavior.
>
>Cc: Greg Kroah-Hartman 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: "K. Y. Srinivasan" 
>Cc: Haiyang Zhang 
>Cc: Stephen Hemminger 
>Cc: Wei Liu 
>Cc: Thomas Gleixner 
>Cc: linuxppc-dev@lists.ozlabs.org
>Cc: linux-hyp...@vger.kernel.org
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> arch/powerpc/platforms/powernv/memtrace.c |  2 +-
> drivers/base/memory.c | 10 --
> drivers/hv/hv_balloon.c   |  2 +-
> include/linux/memory_hotplug.h|  3 ++-
> mm/memory_hotplug.c   | 13 +++--
> 5 files changed, 15 insertions(+), 15 deletions(-)
>
>diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
>b/arch/powerpc/platforms/powernv/memtrace.c
>index d6d64f8718e6..e15a600cfa4d 100644
>--- a/arch/powerpc/platforms/powernv/memtrace.c
>+++ b/arch/powerpc/platforms/powernv/memtrace.c
>@@ -235,7 +235,7 @@ static int memtrace_online(void)
>* If kernel isn't compiled with the auto online option
>* we need to online the memory ourselves.
>*/
>-  if (!memhp_auto_online) {
>+  if (memhp_default_online_type == MMOP_OFFLINE) {
>   lock_device_hotplug();
>   walk_memory_blocks(ent->start, ent->size, NULL,
>  online_mem_block);
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 8a7f29c0bf97..8d3e16dab69f 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -386,10 +386,8 @@ static DEVICE_ATTR_RO(block_size_bytes);
> static ssize_t auto_online_blocks_show(struct device *dev,
>  struct device_attribute *attr, char *buf)
> {
>-  if (memhp_auto_online)
>-  return sprintf(buf, "online\n");
>-  else
>-  return sprintf(buf, "offline\n");
>+  return sprintf(buf, "%s\n",
>+ online_type_to_str[memhp_default_online_type]);
> }
> 
> static ssize_t auto_online_blocks_store(struct device *dev,
>@@ -397,9 +395,9 @@ static ssize_t auto_online_blocks_store(struct device *dev,
>   const char *buf, size_t count)
> {
>   if (sysfs_streq(buf, "online"))
>-  memhp_auto_online = true;
>+  memhp_default_online_type = MMOP_ONLINE;
>   else if (sysfs_streq(buf, "offline"))
>-  memhp_auto_online = false;
>+  memhp_default_online_type = MMOP_OFFLINE;
>   else
>   return -EINVAL;
> 
>diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
>index a02ce43d778d..3b90fd12e0c5 100644
>--- a/drivers/hv/hv_balloon.c
>+++ b/drivers/hv/hv_balloon.c
>@@ -727,7 +727,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
>long size,
>   spin_unlock_irqrestore(_device.ha_lock, flags);
> 
>   init_completion(_device.ol_waitevent);
>-  dm_device.ha_waiting = !memhp_auto_online;
>+  dm_device.ha_waiting = memhp_default_online_type == 
>MMOP_OFFLINE;
> 
>   nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
>   ret = add_memory(nid, PFN_PHYS((start_pfn)),
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index c2e06ed5e0e9..c6e090b34c4b 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -117,7 +117,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
>   struct mhp_restrictions *restrictions);
> extern u64 max_mem_size;
> 
>-extern bool memhp_auto_online;
>+/* Default online_type (MMOP_*) when new memory blocks are added. */
>+extern int memhp_default_online_type;
> /* If movable_node boot option specified */
> extern bool movable_node_enabled;
> static inline bool movable_node_is_enabled(void)
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 1a00b5a37ef6..01443c70aa27 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -67,18 +67,18 @@ void put_online_mems(void)
> bool movable_node_enabled = false;
> 
> #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
>-bool memhp_auto_online;
>+int memhp_default_online_type = MMOP_OFFLINE;
> #else
&g

Re: [PATCH v1 3/5] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 01:30:24PM +0100, David Hildenbrand wrote:
>Let's use a simple array which we can reuse soon. While at it, move the
>string->mmop conversion out of the device hotplug lock.
>
>Cc: Greg Kroah-Hartman 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c | 38 +++---
> 1 file changed, 23 insertions(+), 15 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index e7e77cafef80..8a7f29c0bf97 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -28,6 +28,24 @@
> 
> #define MEMORY_CLASS_NAME "memory"
> 
>+static const char *const online_type_to_str[] = {
>+  [MMOP_OFFLINE] = "offline",
>+  [MMOP_ONLINE] = "online",
>+  [MMOP_ONLINE_KERNEL] = "online_kernel",
>+  [MMOP_ONLINE_MOVABLE] = "online_movable",
>+};
>+
>+static int memhp_online_type_from_str(const char *str)
>+{
>+  int i;
>+
>+  for (i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
>+  if (sysfs_streq(str, online_type_to_str[i]))
>+  return i;
>+  }
>+  return -EINVAL;
>+}
>+
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
> 
> static int sections_per_block;
>@@ -236,26 +254,17 @@ static int memory_subsys_offline(struct device *dev)
> static ssize_t state_store(struct device *dev, struct device_attribute *attr,
>  const char *buf, size_t count)
> {
>+  const int online_type = memhp_online_type_from_str(buf);

In your following patch, you did the same conversion. Is it possible to merge
them into this one?

>   struct memory_block *mem = to_memory_block(dev);
>-  int ret, online_type;
>+  int ret;
>+
>+  if (online_type < 0)
>+  return -EINVAL;
> 
>   ret = lock_device_hotplug_sysfs();
>   if (ret)
>   return ret;
> 
>-  if (sysfs_streq(buf, "online_kernel"))
>-  online_type = MMOP_ONLINE_KERNEL;
>-  else if (sysfs_streq(buf, "online_movable"))
>-  online_type = MMOP_ONLINE_MOVABLE;
>-  else if (sysfs_streq(buf, "online"))
>-  online_type = MMOP_ONLINE;
>-  else if (sysfs_streq(buf, "offline"))
>-  online_type = MMOP_OFFLINE;
>-  else {
>-  ret = -EINVAL;
>-  goto err;
>-  }
>-
>   switch (online_type) {
>   case MMOP_ONLINE_KERNEL:
>   case MMOP_ONLINE_MOVABLE:
>@@ -271,7 +280,6 @@ static ssize_t state_store(struct device *dev, struct 
>device_attribute *attr,
>   ret = -EINVAL; /* should never happen */
>   }
> 
>-err:
>   unlock_device_hotplug();
> 
>   if (ret < 0)
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v1 2/5] drivers/base/memory: map MMOP_OFFLINE to 0

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 01:30:23PM +0100, David Hildenbrand wrote:
>I have no idea why we have to start at -1. Just treat 0 as the special
>case. Clarify a comment (which was wrong, when we come via
>device_online() the first time, the online_type would have been 0 /
>MEM_ONLINE). The default is now always MMOP_OFFLINE.
>
>This is a preparation to use the online_type as an array index.
>
>Cc: Greg Kroah-Hartman 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> drivers/base/memory.c  | 11 ---
> include/linux/memory_hotplug.h |  2 +-
> 2 files changed, 5 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 8c5ce42c0fc3..e7e77cafef80 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -211,17 +211,14 @@ static int memory_subsys_online(struct device *dev)
>   return 0;
> 
>   /*
>-   * If we are called from state_store(), online_type will be
>-   * set >= 0 Otherwise we were called from the device online
>-   * attribute and need to set the online_type.
>+   * When called via device_online() without configuring the online_type,
>+   * we want to default to MMOP_ONLINE.
>*/
>-  if (mem->online_type < 0)
>+  if (mem->online_type == MMOP_OFFLINE)
>   mem->online_type = MMOP_ONLINE;
> 
>   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
>-
>-  /* clear online_type */
>-  mem->online_type = -1;
>+  mem->online_type = MMOP_OFFLINE;
> 
>   return ret;
> }
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index 261dbf010d5d..c2e06ed5e0e9 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -48,7 +48,7 @@ enum {
> /* Types for control the zone type of onlined and offlined memory */
> enum {
>   /* Offline the memory. */
>-  MMOP_OFFLINE = -1,
>+  MMOP_OFFLINE = 0,
>   /* Online the memory. Zone depends, see default_zone_for_pfn(). */
>   MMOP_ONLINE,
>   /* Online the memory to ZONE_NORMAL. */
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v1 1/5] drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE

2020-03-11 Thread Wei Yang
On Wed, Mar 11, 2020 at 01:30:22PM +0100, David Hildenbrand wrote:
>The name is misleading. Let's just name it like the online_type name we
>expose to user space ("online").
>
>Add some documentation to the types.
>
>Cc: Greg Kroah-Hartman 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: "Rafael J. Wysocki" 
>Cc: Baoquan He 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> drivers/base/memory.c  | 9 +
> include/linux/memory_hotplug.h | 6 +-
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 6448c9ece2cb..8c5ce42c0fc3 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -216,7 +216,7 @@ static int memory_subsys_online(struct device *dev)
>* attribute and need to set the online_type.
>*/
>   if (mem->online_type < 0)
>-  mem->online_type = MMOP_ONLINE_KEEP;
>+  mem->online_type = MMOP_ONLINE;
> 
>   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
> 
>@@ -251,7 +251,7 @@ static ssize_t state_store(struct device *dev, struct 
>device_attribute *attr,
>   else if (sysfs_streq(buf, "online_movable"))
>   online_type = MMOP_ONLINE_MOVABLE;
>   else if (sysfs_streq(buf, "online"))
>-  online_type = MMOP_ONLINE_KEEP;
>+  online_type = MMOP_ONLINE;
>   else if (sysfs_streq(buf, "offline"))
>   online_type = MMOP_OFFLINE;
>   else {
>@@ -262,7 +262,7 @@ static ssize_t state_store(struct device *dev, struct 
>device_attribute *attr,
>   switch (online_type) {
>   case MMOP_ONLINE_KERNEL:
>   case MMOP_ONLINE_MOVABLE:
>-  case MMOP_ONLINE_KEEP:
>+  case MMOP_ONLINE:
>   /* mem->online_type is protected by device_hotplug_lock */
>   mem->online_type = online_type;
>   ret = device_online(>dev);
>@@ -342,7 +342,8 @@ static ssize_t valid_zones_show(struct device *dev,
>   }
> 
>   nid = mem->nid;
>-  default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, 
>nr_pages);
>+  default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn,
>+nr_pages);
>   strcat(buf, default_zone->name);
> 
>   print_allowed_zone(buf, nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL,
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index f4d59155f3d4..261dbf010d5d 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -47,9 +47,13 @@ enum {
> 
> /* Types for control the zone type of onlined and offlined memory */
> enum {
>+  /* Offline the memory. */
>   MMOP_OFFLINE = -1,
>-  MMOP_ONLINE_KEEP,
>+  /* Online the memory. Zone depends, see default_zone_for_pfn(). */
>+  MMOP_ONLINE,
>+  /* Online the memory to ZONE_NORMAL. */
>   MMOP_ONLINE_KERNEL,
>+  /* Online the memory to ZONE_MOVABLE. */
>   MMOP_ONLINE_MOVABLE,
> };
> 
>-- 
>2.24.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Wei Yang
On Thu, Feb 06, 2020 at 07:30:51AM +0800, Baoquan He wrote:
>On 02/06/20 at 07:26am, Wei Yang wrote:
>> On Thu, Feb 06, 2020 at 07:08:26AM +0800, Baoquan He wrote:
>> >On 02/06/20 at 06:56am, Wei Yang wrote:
>> >> On Wed, Feb 05, 2020 at 10:48:11PM +0800, Baoquan He wrote:
>> >> >Hi Wei Yang,
>> >> >
>> >> >On 02/05/20 at 05:59pm, Wei Yang wrote:
>> >> >> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> >> >> >index f294918f7211..8dafa1ba8d9f 100644
>> >> >> >--- a/mm/memory_hotplug.c
>> >> >> >+++ b/mm/memory_hotplug.c
>> >> >> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
>> >> >> >unsigned long start_pfn,
>> >> >> >  if (pfn) {
>> >> >> >  zone->zone_start_pfn = pfn;
>> >> >> >  zone->spanned_pages = zone_end_pfn - pfn;
>> >> >> >+ } else {
>> >> >> >+ zone->zone_start_pfn = 0;
>> >> >> >+ zone->spanned_pages = 0;
>> >> >> >  }
>> >> >> >  } else if (zone_end_pfn == end_pfn) {
>> >> >> >  /*
>> >> >> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
>> >> >> >unsigned long start_pfn,
>> >> >> > start_pfn);
>> >> >> >  if (pfn)
>> >> >> >  zone->spanned_pages = pfn - zone_start_pfn + 1;
>> >> >> >+ else {
>> >> >> >+ zone->zone_start_pfn = 0;
>> >> >> >+ zone->spanned_pages = 0;
>> >> >> >+ }
>> >> >> >  }
>> >> >> 
>> >> >> If it is me, I would like to take out these two similar logic out.
>> >> >
>> >> >I also like this style. 
>> >> >> 
>> >> >> For example:
>> >> >> 
>> >> >>if () {
>> >> >>} else if () {
>> >> >>} else {
>> >> >>goto out;
>> >> >Here the last else is unnecessary, right?
>> >> >
>> >> 
>> >> I am afraid not.
>> >> 
>> >> If the range is not the first or last, we would leave pfn not initialized.
>> >
>> >Ah, you are right. I forgot that one. Then pfn can be assigned the
>> >zone_start_pfn as the old code. Then the following logic is the same
>> >as the original code, 
>> >find_smallest_section_pfn()/find_biggest_section_pfn() 
>> >have done the iteration the old for loop was doing.
>> >
>> >unsigned long pfn = zone_start_pfn; 
>> >if () {
>> >} else if () {
>> >} 
>> >
>> >/* The zone has no valid section */
>> >if (!pfn) {
>> >zone->zone_start_pfn = 0;
>> >zone->spanned_pages = 0;
>> >}
>> 
>> This one look better :-)
>
>Thanks for your confirmation, I will make one patch like this and post.

Sure :-)

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Wei Yang
On Thu, Feb 06, 2020 at 07:08:26AM +0800, Baoquan He wrote:
>On 02/06/20 at 06:56am, Wei Yang wrote:
>> On Wed, Feb 05, 2020 at 10:48:11PM +0800, Baoquan He wrote:
>> >Hi Wei Yang,
>> >
>> >On 02/05/20 at 05:59pm, Wei Yang wrote:
>> >> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> >> >index f294918f7211..8dafa1ba8d9f 100644
>> >> >--- a/mm/memory_hotplug.c
>> >> >+++ b/mm/memory_hotplug.c
>> >> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
>> >> >unsigned long start_pfn,
>> >> > if (pfn) {
>> >> > zone->zone_start_pfn = pfn;
>> >> > zone->spanned_pages = zone_end_pfn - pfn;
>> >> >+} else {
>> >> >+zone->zone_start_pfn = 0;
>> >> >+zone->spanned_pages = 0;
>> >> > }
>> >> > } else if (zone_end_pfn == end_pfn) {
>> >> > /*
>> >> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
>> >> >unsigned long start_pfn,
>> >> >start_pfn);
>> >> > if (pfn)
>> >> > zone->spanned_pages = pfn - zone_start_pfn + 1;
>> >> >+else {
>> >> >+zone->zone_start_pfn = 0;
>> >> >+zone->spanned_pages = 0;
>> >> >+}
>> >> > }
>> >> 
>> >> If it is me, I would like to take out these two similar logic out.
>> >
>> >I also like this style. 
>> >> 
>> >> For example:
>> >> 
>> >>   if () {
>> >>   } else if () {
>> >>   } else {
>> >>   goto out;
>> >Here the last else is unnecessary, right?
>> >
>> 
>> I am afraid not.
>> 
>> If the range is not the first or last, we would leave pfn not initialized.
>
>Ah, you are right. I forgot that one. Then pfn can be assigned the
>zone_start_pfn as the old code. Then the following logic is the same
>as the original code, find_smallest_section_pfn()/find_biggest_section_pfn() 
>have done the iteration the old for loop was doing.
>
>   unsigned long pfn = zone_start_pfn; 
>   if () {
>   } else if () {
>   } 
>
>   /* The zone has no valid section */
>   if (!pfn) {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>   }

This one look better :-)

Thanks

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Wei Yang
On Wed, Feb 05, 2020 at 10:48:11PM +0800, Baoquan He wrote:
>Hi Wei Yang,
>
>On 02/05/20 at 05:59pm, Wei Yang wrote:
>> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> >index f294918f7211..8dafa1ba8d9f 100644
>> >--- a/mm/memory_hotplug.c
>> >+++ b/mm/memory_hotplug.c
>> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
>> >unsigned long start_pfn,
>> >if (pfn) {
>> >zone->zone_start_pfn = pfn;
>> >zone->spanned_pages = zone_end_pfn - pfn;
>> >+   } else {
>> >+   zone->zone_start_pfn = 0;
>> >+   zone->spanned_pages = 0;
>> >}
>> >} else if (zone_end_pfn == end_pfn) {
>> >/*
>> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
>> >unsigned long start_pfn,
>> >   start_pfn);
>> >if (pfn)
>> >zone->spanned_pages = pfn - zone_start_pfn + 1;
>> >+   else {
>> >+   zone->zone_start_pfn = 0;
>> >+   zone->spanned_pages = 0;
>> >+   }
>> >}
>> 
>> If it is me, I would like to take out these two similar logic out.
>
>I also like this style. 
>> 
>> For example:
>> 
>>  if () {
>>  } else if () {
>>  } else {
>>  goto out;
>Here the last else is unnecessary, right?
>

I am afraid not.

If the range is not the first or last, we would leave pfn not initialized.


-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-05 Thread Wei Yang
On Sun, Oct 06, 2019 at 10:56:46AM +0200, David Hildenbrand wrote:
>Let's drop the basically unused section stuff and simplify.
>
>Also, let's use a shorter variant to calculate the number of pages to
>the next section boundary.
>
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Dan Williams 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Finally understand the code.

Reviewed-by: Wei Yang 

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()

2020-02-05 Thread Wei Yang
On Sun, Oct 06, 2019 at 10:56:45AM +0200, David Hildenbrand wrote:
>Get rid of the unnecessary local variables.
>
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: David Hildenbrand 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Dan Williams 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Looks reasonable.

Reviewed-by: Wei Yang 

>---
> mm/memory_hotplug.c | 15 ++-
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 8dafa1ba8d9f..843481bd507d 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, 
>struct zone *zone,
> static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>unsigned long end_pfn)
> {
>-  unsigned long zone_start_pfn = zone->zone_start_pfn;
>-  unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
>-  unsigned long zone_end_pfn = z;
>   unsigned long pfn;
>   int nid = zone_to_nid(zone);
> 
>   zone_span_writelock(zone);
>-  if (zone_start_pfn == start_pfn) {
>+  if (zone->zone_start_pfn == start_pfn) {
>   /*
>* If the section is smallest section in the zone, it need
>* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
>@@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, unsigned 
>long start_pfn,
>* for shrinking zone.
>*/
>   pfn = find_smallest_section_pfn(nid, zone, end_pfn,
>-  zone_end_pfn);
>+  zone_end_pfn(zone));
>   if (pfn) {
>+  zone->spanned_pages = zone_end_pfn(zone) - pfn;
>   zone->zone_start_pfn = pfn;
>-  zone->spanned_pages = zone_end_pfn - pfn;
>   } else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>   }
>-  } else if (zone_end_pfn == end_pfn) {
>+  } else if (zone_end_pfn(zone) == end_pfn) {
>   /*
>* If the section is biggest section in the zone, it need
>* shrink zone->spanned_pages.
>* In this case, we find second biggest valid mem_section for
>* shrinking zone.
>*/
>-  pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
>+  pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
>  start_pfn);
>   if (pfn)
>-  zone->spanned_pages = pfn - zone_start_pfn + 1;
>+  zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
>   else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>-- 
>2.21.0

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Wei Yang
On Sun, Oct 06, 2019 at 10:56:44AM +0200, David Hildenbrand wrote:
>If we have holes, the holes will automatically get detected and removed
>once we remove the next bigger/smaller section. The extra checks can
>go.
>
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: David Hildenbrand 
>Cc: Pavel Tatashin 
>Cc: Dan Williams 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 
>---
> mm/memory_hotplug.c | 34 +++---
> 1 file changed, 7 insertions(+), 27 deletions(-)
>
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index f294918f7211..8dafa1ba8d9f 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
>long start_pfn,
>   if (pfn) {
>   zone->zone_start_pfn = pfn;
>   zone->spanned_pages = zone_end_pfn - pfn;
>+  } else {
>+  zone->zone_start_pfn = 0;
>+  zone->spanned_pages = 0;
>   }
>   } else if (zone_end_pfn == end_pfn) {
>   /*
>@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, unsigned 
>long start_pfn,
>  start_pfn);
>   if (pfn)
>   zone->spanned_pages = pfn - zone_start_pfn + 1;
>+  else {
>+  zone->zone_start_pfn = 0;
>+  zone->spanned_pages = 0;
>+  }
>   }

If it is me, I would like to take out these two similar logic out.

For example:

if () {
} else if () {
} else {
goto out;
}


/* The zone has no valid section */
if (!pfn) {
zone->zone_start_pfn = 0;
zone->spanned_pages = 0;
}

out:
zone_span_writeunlock(zone);

Well, this is just my personal taste :-)

>-
>-  /*
>-   * The section is not biggest or smallest mem_section in the zone, it
>-   * only creates a hole in the zone. So in this case, we need not
>-   * change the zone. But perhaps, the zone has only hole data. Thus
>-   * it check the zone has only hole or not.
>-   */
>-  pfn = zone_start_pfn;
>-  for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
>-  if (unlikely(!pfn_to_online_page(pfn)))
>-  continue;
>-
>-  if (page_zone(pfn_to_page(pfn)) != zone)
>-  continue;
>-
>-  /* Skip range to be removed */
>-  if (pfn >= start_pfn && pfn < end_pfn)
>-  continue;
>-
>-  /* If we find valid section, we have nothing to do */
>-  zone_span_writeunlock(zone);
>-  return;
>-  }
>-
>-  /* The zone has no valid section */
>-  zone->zone_start_pfn = 0;
>-  zone->spanned_pages = 0;
>   zone_span_writeunlock(zone);
> }
> 
>-- 
>2.21.0

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn

2020-02-05 Thread Wei Yang
On Wed, Feb 05, 2020 at 09:59:41AM +0100, David Hildenbrand wrote:
>On 05.02.20 09:57, Wei Yang wrote:
>> On Sun, Oct 06, 2019 at 10:56:43AM +0200, David Hildenbrand wrote:
>>> With shrink_pgdat_span() out of the way, we now always have a valid
>>> zone.
>>>
>>> Cc: Andrew Morton 
>>> Cc: Oscar Salvador 
>>> Cc: David Hildenbrand 
>>> Cc: Michal Hocko 
>>> Cc: Pavel Tatashin 
>>> Cc: Dan Williams 
>>> Cc: Wei Yang 
>>> Signed-off-by: David Hildenbrand 
>> 
>> Reviewed-by: Wei Yang 
>
>Just FYI, the patches are now upstream, so the rb's can no longer be
>applied. (but we can send fixes if we find that something is broken ;)
>). Thanks!
>

Thanks for reminding. :-)

>-- 
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me


Re: [PATCH v6 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn

2020-02-05 Thread Wei Yang
On Sun, Oct 06, 2019 at 10:56:43AM +0200, David Hildenbrand wrote:
>With shrink_pgdat_span() out of the way, we now always have a valid
>zone.
>
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: David Hildenbrand 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Dan Williams 
>Cc: Wei Yang 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 


-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 07/11] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-06-05 Thread Wei Yang
On Wed, Jun 05, 2019 at 12:58:46PM +0200, David Hildenbrand wrote:
>On 05.06.19 10:58, David Hildenbrand wrote:
>>>> /*
>>>>  * For now, we have a linear search to go find the appropriate
>>>>  * memory_block corresponding to a particular phys_index. If
>>>> @@ -658,6 +670,11 @@ static int init_memory_block(struct memory_block 
>>>> **memory, int block_id,
>>>>unsigned long start_pfn;
>>>>int ret = 0;
>>>>
>>>> +  mem = find_memory_block_by_id(block_id, NULL);
>>>> +  if (mem) {
>>>> +  put_device(>dev);
>>>> +  return -EEXIST;
>>>> +  }
>>>
>>> find_memory_block_by_id() is not that close to the main idea in this patch.
>>> Would it be better to split this part?
>> 
>> I played with that but didn't like the temporary results (e.g. having to
>> export find_memory_block_by_id()). I'll stick to this for now.
>> 
>>>
>>>>mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>>>>if (!mem)
>>>>return -ENOMEM;
>>>> @@ -699,44 +716,53 @@ static int add_memory_block(int base_section_nr)
>>>>return 0;
>>>> }
>>>>
>>>> +static void unregister_memory(struct memory_block *memory)
>>>> +{
>>>> +  if (WARN_ON_ONCE(memory->dev.bus != _subsys))
>>>> +  return;
>>>> +
>>>> +  /* drop the ref. we got via find_memory_block() */
>>>> +  put_device(>dev);
>>>> +  device_unregister(>dev);
>>>> +}
>>>> +
>>>> /*
>>>> - * need an interface for the VM to add new memory regions,
>>>> - * but without onlining it.
>>>> + * Create memory block devices for the given memory area. Start and size
>>>> + * have to be aligned to memory block granularity. Memory block devices
>>>> + * will be initialized as offline.
>>>>  */
>>>> -int hotplug_memory_register(int nid, struct mem_section *section)
>>>> +int create_memory_block_devices(unsigned long start, unsigned long size)
>>>> {
>>>> -  int block_id = base_memory_block_id(__section_nr(section));
>>>> -  int ret = 0;
>>>> +  const int start_block_id = pfn_to_block_id(PFN_DOWN(start));
>>>> +  int end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
>>>>struct memory_block *mem;
>>>> +  unsigned long block_id;
>>>> +  int ret = 0;
>>>>
>>>> -  mutex_lock(_sysfs_mutex);
>>>> +  if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
>>>> +   !IS_ALIGNED(size, memory_block_size_bytes(
>>>> +  return -EINVAL;
>>>>
>>>> -  mem = find_memory_block(section);
>>>> -  if (mem) {
>>>> -  mem->section_count++;
>>>> -  put_device(>dev);
>>>> -  } else {
>>>> +  mutex_lock(_sysfs_mutex);
>>>> +  for (block_id = start_block_id; block_id != end_block_id; block_id++) {
>>>>ret = init_memory_block(, block_id, MEM_OFFLINE);
>>>>if (ret)
>>>> -  goto out;
>>>> -  mem->section_count++;
>>>> +  break;
>>>> +  mem->section_count = sections_per_block;
>>>> +  }
>>>> +  if (ret) {
>>>> +  end_block_id = block_id;
>>>> +  for (block_id = start_block_id; block_id != end_block_id;
>>>> +   block_id++) {
>>>> +  mem = find_memory_block_by_id(block_id, NULL);
>>>> +  mem->section_count = 0;
>>>> +  unregister_memory(mem);
>>>> +  }
>>>>}
>>>
>>> Would it be better to do this in reverse order?
>>>
>>> And unregister_memory() would free mem, so it is still necessary to set
>>> section_count to 0?
>> 
>> 1. I kept the existing behavior (setting it to 0) for now. I am planning
>> to eventually remove the section count completely (it could be
>> beneficial to detect removing of partially populated memory blocks).
>
>Correction: We already use it to block offlining of partially populated
>memory blocks \o/

Would you mind letting me know where we leverage this?

>
>> 
>> 2. Reverse order: We would have to start with "block_id - 1", I don't
>> like that better.
>> 
>> Thanks for having a look!
>> 
>
>
>-- 
>
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 11/11] mm/memory_hotplug: Remove "zone" parameter from sparse_remove_one_section

2019-06-05 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:52PM +0200, David Hildenbrand wrote:
>The parameter is unused, so let's drop it. Memory removal paths should
>never care about zones. This is the job of memory offlining and will
>require more refactorings.
>
>Reviewed-by: Dan Williams 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> include/linux/memory_hotplug.h | 2 +-
> mm/memory_hotplug.c| 2 +-
> mm/sparse.c| 4 ++--
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index 2f1f87e13baa..1a4257c5f74c 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -346,7 +346,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, 
>unsigned long start_pfn,
> extern bool is_memblock_offlined(struct memory_block *mem);
> extern int sparse_add_one_section(int nid, unsigned long start_pfn,
> struct vmem_altmap *altmap);
>-extern void sparse_remove_one_section(struct zone *zone, struct mem_section 
>*ms,
>+extern void sparse_remove_one_section(struct mem_section *ms,
>   unsigned long map_offset, struct vmem_altmap *altmap);
> extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
> unsigned long pnum);
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 82136c5b4c5f..e48ec7b9dee2 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -524,7 +524,7 @@ static void __remove_section(struct zone *zone, struct 
>mem_section *ms,
>   start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
>   __remove_zone(zone, start_pfn);
> 
>-  sparse_remove_one_section(zone, ms, map_offset, altmap);
>+  sparse_remove_one_section(ms, map_offset, altmap);
> }
> 
> /**
>diff --git a/mm/sparse.c b/mm/sparse.c
>index d1d5e05f5b8d..1552c855d62a 100644
>--- a/mm/sparse.c
>+++ b/mm/sparse.c
>@@ -800,8 +800,8 @@ static void free_section_usemap(struct page *memmap, 
>unsigned long *usemap,
>   free_map_bootmem(memmap);
> }
> 
>-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>-  unsigned long map_offset, struct vmem_altmap *altmap)
>+void sparse_remove_one_section(struct mem_section *ms, unsigned long 
>map_offset,
>+ struct vmem_altmap *altmap)
> {
>   struct page *memmap = NULL;
>   unsigned long *usemap = NULL;
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-06-05 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote:
>We really don't want anything during memory hotunplug to fail.
>We always pass a valid memory block device, that check can go. Avoid
>allocating memory and eventually failing. As we are always called under
>lock, we can use a static piece of memory. This avoids having to put
>the structure onto the stack, having to guess about the stack size
>of callers.
>
>Patch inspired by a patch from Oscar Salvador.
>
>In the future, there might be no need to iterate over nodes at all.
>mem->nid should tell us exactly what to remove. Memory block devices
>with mixed nodes (added during boot) should properly fenced off and never
>removed.
>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: Alex Deucher 
>Cc: "David S. Miller" 
>Cc: Mark Brown 
>Cc: Chris Wilson 
>Cc: David Hildenbrand 
>Cc: Oscar Salvador 
>Cc: Andrew Morton 
>Cc: Jonathan Cameron 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> drivers/base/node.c  | 18 +-
> include/linux/node.h |  5 ++---
> 2 files changed, 7 insertions(+), 16 deletions(-)
>
>diff --git a/drivers/base/node.c b/drivers/base/node.c
>index 04fdfa99b8bc..9be88fd05147 100644
>--- a/drivers/base/node.c
>+++ b/drivers/base/node.c
>@@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block 
>*mem_blk, void *arg)
> 
> /*
>  * Unregister memory block device under all nodes that it spans.
>+ * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes).
>  */
>-int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>+void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> {
>-  NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
>   unsigned long pfn, sect_start_pfn, sect_end_pfn;
>+  static nodemask_t unlinked_nodes;
> 
>-  if (!mem_blk) {
>-  NODEMASK_FREE(unlinked_nodes);
>-  return -EFAULT;
>-  }
>-  if (!unlinked_nodes)
>-  return -ENOMEM;
>-  nodes_clear(*unlinked_nodes);
>-
>+  nodes_clear(unlinked_nodes);
>   sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
>   sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
>   for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>@@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct 
>memory_block *mem_blk)
>   continue;
>   if (!node_online(nid))
>   continue;
>-  if (node_test_and_set(nid, *unlinked_nodes))
>+  if (node_test_and_set(nid, unlinked_nodes))
>   continue;
>   sysfs_remove_link(_devices[nid]->dev.kobj,
>kobject_name(_blk->dev.kobj));
>   sysfs_remove_link(_blk->dev.kobj,
>kobject_name(_devices[nid]->dev.kobj));
>   }
>-  NODEMASK_FREE(unlinked_nodes);
>-  return 0;
> }
> 
> int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
>diff --git a/include/linux/node.h b/include/linux/node.h
>index 02a29e71b175..548c226966a2 100644
>--- a/include/linux/node.h
>+++ b/include/linux/node.h
>@@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, 
>unsigned int nid);
> extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
> extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>   void *arg);
>-extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
>+extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk);
> 
> extern int register_memory_node_under_compute_node(unsigned int mem_nid,
>  unsigned int cpu_nid,
>@@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct 
>memory_block *mem_blk,
> {
>   return 0;
> }
>-static inline int unregister_memory_block_under_nodes(struct memory_block 
>*mem_blk)
>+static inline void unregister_memory_block_under_nodes(struct memory_block 
>*mem_blk)
> {
>-  return 0;
> }
> 
> static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 09/11] mm/memory_hotplug: Remove memory block devices before arch_remove_memory()

2019-06-04 Thread Wei Yang
; extern int register_memory_isolate_notifier(struct notifier_block *nb);
> extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
> int create_memory_block_devices(unsigned long start, unsigned long size);
>-extern void unregister_memory_section(struct mem_section *);
>+void remove_memory_block_devices(unsigned long start, unsigned long size);
> extern int memory_dev_init(void);
> extern int memory_notify(unsigned long val, void *v);
> extern int memory_isolate_notify(unsigned long val, void *v);
>diff --git a/include/linux/node.h b/include/linux/node.h
>index 1a557c589ecb..02a29e71b175 100644
>--- a/include/linux/node.h
>+++ b/include/linux/node.h
>@@ -139,8 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, 
>unsigned int nid);
> extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
> extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>   void *arg);
>-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>- unsigned long phys_index);
>+extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
> 
> extern int register_memory_node_under_compute_node(unsigned int mem_nid,
>  unsigned int cpu_nid,
>@@ -176,8 +175,7 @@ static inline int register_mem_sect_under_node(struct 
>memory_block *mem_blk,
> {
>   return 0;
> }
>-static inline int unregister_mem_sect_under_nodes(struct memory_block 
>*mem_blk,
>-unsigned long phys_index)
>+static inline int unregister_memory_block_under_nodes(struct memory_block 
>*mem_blk)
> {
>   return 0;
> }
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index 9a92549ef23b..82136c5b4c5f 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -520,8 +520,6 @@ static void __remove_section(struct zone *zone, struct 
>mem_section *ms,
>   if (WARN_ON_ONCE(!valid_section(ms)))
>   return;
> 
>-      unregister_memory_section(ms);
>-
>   scn_nr = __section_nr(ms);
>   start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
>   __remove_zone(zone, start_pfn);
>@@ -1845,6 +1843,9 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>   memblock_free(start, size);
>   memblock_remove(start, size);
> 
>+  /* remove memory block devices before removing memory */
>+  remove_memory_block_devices(start, size);
>+
>   arch_remove_memory(nid, start, size, NULL);
>   __release_memory_resource(start, size);
> 
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 08/11] mm/memory_hotplug: Drop MHP_MEMBLOCK_API

2019-06-04 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:49PM +0200, David Hildenbrand wrote:
>No longer needed, the callers of arch_add_memory() can handle this
>manually.
>
>Cc: Andrew Morton 
>Cc: David Hildenbrand 
>Cc: Michal Hocko 
>Cc: Oscar Salvador 
>Cc: Pavel Tatashin 
>Cc: Wei Yang 
>Cc: Joonsoo Kim 
>Cc: Qian Cai 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 

>---
> include/linux/memory_hotplug.h | 8 
> mm/memory_hotplug.c| 9 +++--
> 2 files changed, 3 insertions(+), 14 deletions(-)
>
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index 2d4de313926d..2f1f87e13baa 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -128,14 +128,6 @@ extern void arch_remove_memory(int nid, u64 start, u64 
>size,
> extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
>  unsigned long nr_pages, struct vmem_altmap *altmap);
> 
>-/*
>- * Do we want sysfs memblock files created. This will allow userspace to 
>online
>- * and offline memory explicitly. Lack of this bit means that the caller has 
>to
>- * call move_pfn_range_to_zone to finish the initialization.
>- */
>-
>-#define MHP_MEMBLOCK_API   (1<<0)
>-
> /* reasonably generic interface to expand the physical pages */
> extern int __add_pages(int nid, unsigned long start_pfn, unsigned long 
> nr_pages,
>  struct mhp_restrictions *restrictions);
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index b1fde90bbf19..9a92549ef23b 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -251,7 +251,7 @@ void __init register_page_bootmem_info_node(struct 
>pglist_data *pgdat)
> #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
> 
> static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
>-  struct vmem_altmap *altmap, bool want_memblock)
>+ struct vmem_altmap *altmap)
> {
>   int ret;
> 
>@@ -294,8 +294,7 @@ int __ref __add_pages(int nid, unsigned long 
>phys_start_pfn,
>   }
> 
>   for (i = start_sec; i <= end_sec; i++) {
>-  err = __add_section(nid, section_nr_to_pfn(i), altmap,
>-  restrictions->flags & MHP_MEMBLOCK_API);
>+  err = __add_section(nid, section_nr_to_pfn(i), altmap);
> 
>   /*
>* EEXIST is finally dealt with by ioresource collision
>@@ -1067,9 +1066,7 @@ static int online_memory_block(struct memory_block *mem, 
>void *arg)
>  */
> int __ref add_memory_resource(int nid, struct resource *res)
> {
>-  struct mhp_restrictions restrictions = {
>-  .flags = MHP_MEMBLOCK_API,
>-  };
>+  struct mhp_restrictions restrictions = {};
>   u64 start, size;
>   bool new_node = false;
>   int ret;
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 07/11] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-06-04 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:48PM +0200, David Hildenbrand wrote:
>Only memory to be added to the buddy and to be onlined/offlined by
>user space using /sys/devices/system/memory/... needs (and should have!)
>memory block devices.
>
>Factor out creation of memory block devices. Create all devices after
>arch_add_memory() succeeded. We can later drop the want_memblock parameter,
>because it is now effectively stale.
>
>Only after memory block devices have been added, memory can be onlined
>by user space. This implies, that memory is not visible to user space at
>all before arch_add_memory() succeeded.
>
>While at it
>- use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
>- introduce find_memory_block_by_id() to search via block id
>- Use find_memory_block_by_id() in init_memory_block() to catch
>  duplicates

Generally looks good to me besides two tiny comments.

>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: David Hildenbrand 
>Cc: "mike.tra...@hpe.com" 
>Cc: Andrew Morton 
>Cc: Ingo Molnar 
>Cc: Andrew Banman 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Qian Cai 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c  | 82 +++---
> include/linux/memory.h |  2 +-
> mm/memory_hotplug.c| 15 
> 3 files changed, 63 insertions(+), 36 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index ac17c95a5f28..5a0370f0c506 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -39,6 +39,11 @@ static inline int base_memory_block_id(int section_nr)
>   return section_nr / sections_per_block;
> }
> 
>+static inline int pfn_to_block_id(unsigned long pfn)
>+{
>+  return base_memory_block_id(pfn_to_section_nr(pfn));
>+}
>+
> static int memory_subsys_online(struct device *dev);
> static int memory_subsys_offline(struct device *dev);
> 
>@@ -582,10 +587,9 @@ int __weak arch_get_memory_phys_device(unsigned long 
>start_pfn)
>  * A reference for the returned object is held and the reference for the
>  * hinted object is released.
>  */
>-struct memory_block *find_memory_block_hinted(struct mem_section *section,
>-struct memory_block *hint)
>+static struct memory_block *find_memory_block_by_id(int block_id,
>+  struct memory_block *hint)
> {
>-  int block_id = base_memory_block_id(__section_nr(section));
>   struct device *hintdev = hint ? >dev : NULL;
>   struct device *dev;
> 
>@@ -597,6 +601,14 @@ struct memory_block *find_memory_block_hinted(struct 
>mem_section *section,
>   return to_memory_block(dev);
> }
> 
>+struct memory_block *find_memory_block_hinted(struct mem_section *section,
>+struct memory_block *hint)
>+{
>+  int block_id = base_memory_block_id(__section_nr(section));
>+
>+  return find_memory_block_by_id(block_id, hint);
>+}
>+
> /*
>  * For now, we have a linear search to go find the appropriate
>  * memory_block corresponding to a particular phys_index. If
>@@ -658,6 +670,11 @@ static int init_memory_block(struct memory_block 
>**memory, int block_id,
>   unsigned long start_pfn;
>   int ret = 0;
> 
>+  mem = find_memory_block_by_id(block_id, NULL);
>+  if (mem) {
>+  put_device(>dev);
>+  return -EEXIST;
>+  }

find_memory_block_by_id() is not that close to the main idea in this patch.
Would it be better to split this part?

>   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>   if (!mem)
>   return -ENOMEM;
>@@ -699,44 +716,53 @@ static int add_memory_block(int base_section_nr)
>   return 0;
> }
> 
>+static void unregister_memory(struct memory_block *memory)
>+{
>+  if (WARN_ON_ONCE(memory->dev.bus != _subsys))
>+  return;
>+
>+  /* drop the ref. we got via find_memory_block() */
>+  put_device(>dev);
>+  device_unregister(>dev);
>+}
>+
> /*
>- * need an interface for the VM to add new memory regions,
>- * but without onlining it.
>+ * Create memory block devices for the given memory area. Start and size
>+ * have to be aligned to memory block granularity. Memory block devices
>+ * will be initialized as offline.
>  */
>-int hotplug_memory_register(int nid, struct mem_section *section)
>+int create_memory_block_devices(unsigned long start, unsigned long size)
> {
>-  int block_id = base_memory_block_id(__section_nr(section));
>-  int ret

Re: [PATCH v3 06/11] mm/memory_hotplug: Allow arch_remove_pages() without CONFIG_MEMORY_HOTREMOVE

2019-06-04 Thread Wei Yang
On Tue, Jun 04, 2019 at 08:59:43AM +0200, David Hildenbrand wrote:
>On 04.06.19 00:15, Wei Yang wrote:
>> Allow arch_remove_pages() or arch_remove_memory()?
>
>Looks like I merged __remove_pages() and arch_remove_memory().
>
>@Andrew, can you fix this up to
>
>"mm/memory_hotplug: Allow arch_remove_memory() without
>CONFIG_MEMORY_HOTREMOVE"
>
>? Thanks!
>

Already merged?

>> 
>> And want to confirm the kernel build on affected arch succeed?
>
>I compile-tested on s390x and x86. As the patches are in linux-next for
>some time, I think the other builds are also fine.
>

Yep, sounds good~

>Thanks!
>
>-- 
>
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 06/11] mm/memory_hotplug: Allow arch_remove_pages() without CONFIG_MEMORY_HOTREMOVE

2019-06-03 Thread Wei Yang
Allow arch_remove_pages() or arch_remove_memory()?

And want to confirm the kernel build on affected arch succeed?

On Mon, May 27, 2019 at 01:11:47PM +0200, David Hildenbrand wrote:
>We want to improve error handling while adding memory by allowing
>to use arch_remove_memory() and __remove_pages() even if
>CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
>
>   arch_add_memory()
>   rc = do_something();
>   if (rc) {
>   arch_remove_memory();
>   }
>
>We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
>quite some dependencies for memory offlining.
>
>Cc: Tony Luck 
>Cc: Fenghua Yu 
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: Martin Schwidefsky 
>Cc: Heiko Carstens 
>Cc: Yoshinori Sato 
>Cc: Rich Felker 
>Cc: Dave Hansen 
>Cc: Andy Lutomirski 
>Cc: Peter Zijlstra 
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: Borislav Petkov 
>Cc: "H. Peter Anvin" 
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: Andrew Morton 
>Cc: Michal Hocko 
>Cc: Mike Rapoport 
>Cc: David Hildenbrand 
>Cc: Oscar Salvador 
>Cc: "Kirill A. Shutemov" 
>Cc: Alex Deucher 
>Cc: "David S. Miller" 
>Cc: Mark Brown 
>Cc: Chris Wilson 
>Cc: Christophe Leroy 
>Cc: Nicholas Piggin 
>Cc: Vasily Gorbik 
>Cc: Rob Herring 
>Cc: Masahiro Yamada 
>Cc: "mike.tra...@hpe.com" 
>Cc: Andrew Banman 
>Cc: Pavel Tatashin 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Qian Cai 
>Cc: Mathieu Malaterre 
>Cc: Baoquan He 
>Cc: Logan Gunthorpe 
>Cc: Anshuman Khandual 
>Signed-off-by: David Hildenbrand 
>---
> arch/arm64/mm/mmu.c| 2 --
> arch/ia64/mm/init.c| 2 --
> arch/powerpc/mm/mem.c  | 2 --
> arch/s390/mm/init.c| 2 --
> arch/sh/mm/init.c  | 2 --
> arch/x86/mm/init_32.c  | 2 --
> arch/x86/mm/init_64.c  | 2 --
> drivers/base/memory.c  | 2 --
> include/linux/memory.h | 2 --
> include/linux/memory_hotplug.h | 2 --
> mm/memory_hotplug.c| 2 --
> mm/sparse.c| 6 --
> 12 files changed, 28 deletions(-)
>
>diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>index e569a543c384..9ccd7539f2d4 100644
>--- a/arch/arm64/mm/mmu.c
>+++ b/arch/arm64/mm/mmu.c
>@@ -1084,7 +1084,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
>  restrictions);
> }
>-#ifdef CONFIG_MEMORY_HOTREMOVE
> void arch_remove_memory(int nid, u64 start, u64 size,
>   struct vmem_altmap *altmap)
> {
>@@ -1103,4 +1102,3 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   __remove_pages(zone, start_pfn, nr_pages, altmap);
> }
> #endif
>-#endif
>diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
>index d28e29103bdb..aae75fd7b810 100644
>--- a/arch/ia64/mm/init.c
>+++ b/arch/ia64/mm/init.c
>@@ -681,7 +681,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   return ret;
> }
> 
>-#ifdef CONFIG_MEMORY_HOTREMOVE
> void arch_remove_memory(int nid, u64 start, u64 size,
>   struct vmem_altmap *altmap)
> {
>@@ -693,4 +692,3 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   __remove_pages(zone, start_pfn, nr_pages, altmap);
> }
> #endif
>-#endif
>diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>index e885fe2aafcc..e4bc2dc3f593 100644
>--- a/arch/powerpc/mm/mem.c
>+++ b/arch/powerpc/mm/mem.c
>@@ -130,7 +130,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>   return __add_pages(nid, start_pfn, nr_pages, restrictions);
> }
> 
>-#ifdef CONFIG_MEMORY_HOTREMOVE
> void __ref arch_remove_memory(int nid, u64 start, u64 size,
>struct vmem_altmap *altmap)
> {
>@@ -164,7 +163,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
>   pr_warn("Hash collision while resizing HPT\n");
> }
> #endif
>-#endif /* CONFIG_MEMORY_HOTPLUG */
> 
> #ifndef CONFIG_NEED_MULTIPLE_NODES
> void __init mem_topology_setup(void)
>diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>index 14955e0a9fcf..ffb81fe95c77 100644
>--- a/arch/s390/mm/init.c
>+++ b/arch/s390/mm/init.c
>@@ -239,7 +239,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   return rc;
> }
> 
>-#ifdef CONFIG_MEMORY_HOTREMOVE
> void arch_remove_memory(int nid, u64 start, u64 size,
>   struct vmem_altmap *altmap)
> {
>@@ -251,5 +250,4 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  

Re: [PATCH v3 05/11] drivers/base/memory: Pass a block_id to init_memory_block()

2019-06-03 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:46PM +0200, David Hildenbrand wrote:
>We'll rework hotplug_memory_register() shortly, so it no longer consumes
>pass a section.
>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c | 15 +++
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index f180427e48f4..f914fa6fe350 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -651,21 +651,18 @@ int register_memory(struct memory_block *memory)
>   return ret;
> }
> 
>-static int init_memory_block(struct memory_block **memory,
>-   struct mem_section *section, unsigned long state)
>+static int init_memory_block(struct memory_block **memory, int block_id,
>+   unsigned long state)
> {
>   struct memory_block *mem;
>   unsigned long start_pfn;
>-  int scn_nr;
>   int ret = 0;
> 
>   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>   if (!mem)
>   return -ENOMEM;
> 
>-  scn_nr = __section_nr(section);
>-  mem->start_section_nr =
>-  base_memory_block_id(scn_nr) * sections_per_block;
>+  mem->start_section_nr = block_id * sections_per_block;
>   mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;
>   mem->state = state;
>   start_pfn = section_nr_to_pfn(mem->start_section_nr);
>@@ -694,7 +691,8 @@ static int add_memory_block(int base_section_nr)
> 
>   if (section_count == 0)
>   return 0;
>-  ret = init_memory_block(, __nr_to_section(section_nr), MEM_ONLINE);
>+  ret = init_memory_block(, base_memory_block_id(base_section_nr),
>+  MEM_ONLINE);

If my understanding is correct, section_nr could be removed too.

>   if (ret)
>   return ret;
>   mem->section_count = section_count;
>@@ -707,6 +705,7 @@ static int add_memory_block(int base_section_nr)
>  */
> int hotplug_memory_register(int nid, struct mem_section *section)
> {
>+  int block_id = base_memory_block_id(__section_nr(section));
>   int ret = 0;
>   struct memory_block *mem;
> 
>@@ -717,7 +716,7 @@ int hotplug_memory_register(int nid, struct mem_section 
>*section)
>   mem->section_count++;
>   put_device(>dev);
>   } else {
>-  ret = init_memory_block(, section, MEM_OFFLINE);
>+  ret = init_memory_block(, block_id, MEM_OFFLINE);
>   if (ret)
>   goto out;
>   mem->section_count++;
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 00/11] mm/memory_hotplug: Factor out memory block devicehandling

2019-06-03 Thread Wei Yang
IMHO, there is some typo.

s/devicehandling/device handling/

On Mon, May 27, 2019 at 01:11:41PM +0200, David Hildenbrand wrote:
>We only want memory block devices for memory to be onlined/offlined
>(add/remove from the buddy). This is required so user space can
>online/offline memory and kdump gets notified about newly onlined memory.
>
>Let's factor out creation/removal of memory block devices. This helps
>to further cleanup arch_add_memory/arch_remove_memory() and to make
>implementation of new features easier - especially sub-section
>memory hot add from Dan.
>
>Anshuman Khandual is currently working on arch_remove_memory(). I added
>a temporary solution via "arm64/mm: Add temporary arch_remove_memory()
>implementation", that is sufficient as a firsts tep in the context of

s/firsts tep/first step/

>this series. (we don't cleanup page tables in case anything goes
>wrong already)
>
>Did a quick sanity test with DIMM plug/unplug, making sure all devices
>and sysfs links properly get added/removed. Compile tested on s390x and
>x86-64.
>
>Based on next/master.
>
>Next refactoring on my list will be making sure that remove_memory()
>will never deal with zones / access "struct pages". Any kind of zone
>handling will have to be done when offlining system memory / before
>removing device memory. I am thinking about remove_pfn_range_from_zone()",
>du undo everything "move_pfn_range_to_zone()" did.

what is "du undo"? I may not get it.

>
>v2 -> v3:
>- Add "s390x/mm: Fail when an altmap is used for arch_add_memory()"
>- Add "arm64/mm: Add temporary arch_remove_memory() implementation"
>- Add "drivers/base/memory: Pass a block_id to init_memory_block()"
>- Various changes to "mm/memory_hotplug: Create memory block devices
>  after arch_add_memory()" and "mm/memory_hotplug: Create memory block
>  devices after arch_add_memory()" due to switching from sections to
>  block_id's.
>
>v1 -> v2:
>- s390x/mm: Implement arch_remove_memory()
>-- remove mapping after "__remove_pages"
>
>David Hildenbrand (11):
>  mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
>  s390x/mm: Fail when an altmap is used for arch_add_memory()
>  s390x/mm: Implement arch_remove_memory()
>  arm64/mm: Add temporary arch_remove_memory() implementation
>  drivers/base/memory: Pass a block_id to init_memory_block()
>  mm/memory_hotplug: Allow arch_remove_pages() without
>CONFIG_MEMORY_HOTREMOVE
>  mm/memory_hotplug: Create memory block devices after arch_add_memory()
>  mm/memory_hotplug: Drop MHP_MEMBLOCK_API
>  mm/memory_hotplug: Remove memory block devices before
>arch_remove_memory()
>  mm/memory_hotplug: Make unregister_memory_block_under_nodes() never
>fail
>  mm/memory_hotplug: Remove "zone" parameter from
>sparse_remove_one_section
>
> arch/arm64/mm/mmu.c|  17 +
> arch/ia64/mm/init.c|   2 -
> arch/powerpc/mm/mem.c  |   2 -
> arch/s390/mm/init.c|  18 +++--
> arch/sh/mm/init.c  |   2 -
> arch/x86/mm/init_32.c  |   2 -
> arch/x86/mm/init_64.c  |   2 -
> drivers/base/memory.c  | 134 +++--
> drivers/base/node.c    |  27 +++
> include/linux/memory.h |   6 +-
> include/linux/memory_hotplug.h |  12 +--
> include/linux/node.h   |   7 +-
> mm/memory_hotplug.c|  44 +--
> mm/sparse.c|  10 +--
> 14 files changed, 140 insertions(+), 145 deletions(-)
>
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v3 04/11] arm64/mm: Add temporary arch_remove_memory() implementation

2019-06-03 Thread Wei Yang
On Mon, May 27, 2019 at 01:11:45PM +0200, David Hildenbrand wrote:
>A proper arch_remove_memory() implementation is on its way, which also
>cleanly removes page tables in arch_add_memory() in case something goes
>wrong.

Would this be better to understand?

removes page tables created in arch_add_memory

>
>As we want to use arch_remove_memory() in case something goes wrong
>during memory hotplug after arch_add_memory() finished, let's add
>a temporary hack that is sufficient enough until we get a proper
>implementation that cleans up page table entries.
>
>We will remove CONFIG_MEMORY_HOTREMOVE around this code in follow up
>patches.
>
>Cc: Catalin Marinas 
>Cc: Will Deacon 
>Cc: Mark Rutland 
>Cc: Andrew Morton 
>Cc: Ard Biesheuvel 
>Cc: Chintan Pandya 
>Cc: Mike Rapoport 
>Cc: Jun Yao 
>Cc: Yu Zhao 
>Cc: Robin Murphy 
>Cc: Anshuman Khandual 
>Signed-off-by: David Hildenbrand 
>---
> arch/arm64/mm/mmu.c | 19 +++
> 1 file changed, 19 insertions(+)
>
>diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>index a1bfc4413982..e569a543c384 100644
>--- a/arch/arm64/mm/mmu.c
>+++ b/arch/arm64/mm/mmu.c
>@@ -1084,4 +1084,23 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
>  restrictions);
> }
>+#ifdef CONFIG_MEMORY_HOTREMOVE
>+void arch_remove_memory(int nid, u64 start, u64 size,
>+  struct vmem_altmap *altmap)
>+{
>+  unsigned long start_pfn = start >> PAGE_SHIFT;
>+  unsigned long nr_pages = size >> PAGE_SHIFT;
>+  struct zone *zone;
>+
>+  /*
>+   * FIXME: Cleanup page tables (also in arch_add_memory() in case
>+   * adding fails). Until then, this function should only be used
>+   * during memory hotplug (adding memory), not for memory
>+   * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
>+   * unlocked yet.
>+   */
>+  zone = page_zone(pfn_to_page(start_pfn));

Compared with arch_remove_memory in x86. If altmap is not NULL, zone will be
retrieved from page related to altmap. Not sure why this is not the same?

>+  __remove_pages(zone, start_pfn, nr_pages, altmap);
>+}
>+#endif
> #endif
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-09 Thread Wei Yang
On Thu, May 09, 2019 at 04:58:56PM +0200, David Hildenbrand wrote:
>On 09.05.19 16:31, Wei Yang wrote:
>> On Tue, May 07, 2019 at 08:38:00PM +0200, David Hildenbrand wrote:
>>> Only memory to be added to the buddy and to be onlined/offlined by
>>> user space using memory block devices needs (and should have!) memory
>>> block devices.
>>>
>>> Factor out creation of memory block devices Create all devices after
>>> arch_add_memory() succeeded. We can later drop the want_memblock parameter,
>>> because it is now effectively stale.
>>>
>>> Only after memory block devices have been added, memory can be onlined
>>> by user space. This implies, that memory is not visible to user space at
>>> all before arch_add_memory() succeeded.
>>>
>>> Cc: Greg Kroah-Hartman 
>>> Cc: "Rafael J. Wysocki" 
>>> Cc: David Hildenbrand 
>>> Cc: "mike.tra...@hpe.com" 
>>> Cc: Andrew Morton 
>>> Cc: Ingo Molnar 
>>> Cc: Andrew Banman 
>>> Cc: Oscar Salvador 
>>> Cc: Michal Hocko 
>>> Cc: Pavel Tatashin 
>>> Cc: Qian Cai 
>>> Cc: Wei Yang 
>>> Cc: Arun KS 
>>> Cc: Mathieu Malaterre 
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>> drivers/base/memory.c  | 70 ++
>>> include/linux/memory.h |  2 +-
>>> mm/memory_hotplug.c| 15 -
>>> 3 files changed, 53 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 6e0cb4fda179..862c202a18ca 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
>>> return 0;
>>> }
>>>
>>> +static void unregister_memory(struct memory_block *memory)
>>> +{
>>> +   BUG_ON(memory->dev.bus != _subsys);
>>> +
>>> +   /* drop the ref. we got via find_memory_block() */
>>> +   put_device(>dev);
>>> +   device_unregister(>dev);
>>> +}
>>> +
>>> /*
>>> - * need an interface for the VM to add new memory regions,
>>> - * but without onlining it.
>>> + * Create memory block devices for the given memory area. Start and size
>>> + * have to be aligned to memory block granularity. Memory block devices
>>> + * will be initialized as offline.
>>>  */
>>> -int hotplug_memory_register(int nid, struct mem_section *section)
>>> +int hotplug_memory_register(unsigned long start, unsigned long size)
>> 
>> One trivial suggestion about the function name.
>> 
>> For memory_block device, sometimes we use the full name
>> 
>> find_memory_block
>> init_memory_block
>> add_memory_block
>> 
>> But sometimes we use *nick* name
>> 
>> hotplug_memory_register
>> register_memory
>> unregister_memory
>> 
>> This is a little bit confusion.
>> 
>> Can we use one name convention here?
>
>We can just go for
>
>crate_memory_blocks() and free_memory_blocks(). Or do
>you have better suggestions?

s/crate/create/

Looks good to me.

>
>(I would actually even prefer "memory_block_devices", because memory
>blocks have different meanins)
>

Agree with you, this comes to my mind sometime ago :-)

>> 
>> [...]
>> 
>>> /*
>>> @@ -1106,6 +1100,13 @@ int __ref add_memory_resource(int nid, struct 
>>> resource *res)
>>> if (ret < 0)
>>> goto error;
>>>
>>> +   /* create memory block devices after memory was added */
>>> +   ret = hotplug_memory_register(start, size);
>>> +   if (ret) {
>>> +       arch_remove_memory(nid, start, size, NULL);
>> 
>> Functionally, it works I think.
>> 
>> But arch_remove_memory() would remove pages from zone. At this point, we just
>> allocate section/mmap for pages, the zones are empty and pages are not
>> connected to zone.
>> 
>> Function  zone = page_zone(page); always gets zone #0, since pages->flags is >> 0
>> at  this point. This is not exact.
>> 
>> Would we add some comment to mention this? Or we need to clean up
>> arch_remove_memory() to take out __remove_zone()?
>
>That is precisely what is on my list next (see cover letter).This is
>already broken when memory that was never onlined is removed again.
>So I am planning to fix that independently.
>

Sounds great :-)

Hope you would cc me in the following series.

>
>-- 
>
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-09 Thread Wei Yang
On Tue, May 07, 2019 at 08:38:00PM +0200, David Hildenbrand wrote:
>Only memory to be added to the buddy and to be onlined/offlined by
>user space using memory block devices needs (and should have!) memory
>block devices.
>
>Factor out creation of memory block devices Create all devices after
>arch_add_memory() succeeded. We can later drop the want_memblock parameter,
>because it is now effectively stale.
>
>Only after memory block devices have been added, memory can be onlined
>by user space. This implies, that memory is not visible to user space at
>all before arch_add_memory() succeeded.
>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: David Hildenbrand 
>Cc: "mike.tra...@hpe.com" 
>Cc: Andrew Morton 
>Cc: Ingo Molnar 
>Cc: Andrew Banman 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Qian Cai 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c  | 70 ++
> include/linux/memory.h |  2 +-
> mm/memory_hotplug.c| 15 -
> 3 files changed, 53 insertions(+), 34 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 6e0cb4fda179..862c202a18ca 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
>   return 0;
> }
> 
>+static void unregister_memory(struct memory_block *memory)
>+{
>+  BUG_ON(memory->dev.bus != _subsys);
>+
>+  /* drop the ref. we got via find_memory_block() */
>+  put_device(>dev);
>+  device_unregister(>dev);
>+}
>+
> /*
>- * need an interface for the VM to add new memory regions,
>- * but without onlining it.
>+ * Create memory block devices for the given memory area. Start and size
>+ * have to be aligned to memory block granularity. Memory block devices
>+ * will be initialized as offline.
>  */
>-int hotplug_memory_register(int nid, struct mem_section *section)
>+int hotplug_memory_register(unsigned long start, unsigned long size)

One trivial suggestion about the function name.

For memory_block device, sometimes we use the full name

find_memory_block
init_memory_block
add_memory_block

But sometimes we use *nick* name

hotplug_memory_register
register_memory
unregister_memory

This is a little bit confusion.

Can we use one name convention here? 

[...]

> /*
>@@ -1106,6 +1100,13 @@ int __ref add_memory_resource(int nid, struct resource 
>*res)
>   if (ret < 0)
>   goto error;
> 
>+  /* create memory block devices after memory was added */
>+  ret = hotplug_memory_register(start, size);
>+  if (ret) {
>+  arch_remove_memory(nid, start, size, NULL);

Functionally, it works I think.

But arch_remove_memory() would remove pages from zone. At this point, we just
allocate section/mmap for pages, the zones are empty and pages are not
connected to zone.

Function  zone = page_zone(page); always gets zone #0, since pages->flags is 0
at  this point. This is not exact.

Would we add some comment to mention this? Or we need to clean up
arch_remove_memory() to take out __remove_zone()?


>+  goto error;
>+  }
>+
>   if (new_node) {
>   /* If sysfs file of new node can't be created, cpu on the node
>* can't be hot-added. There is no rollback way now.
>-- 
>2.20.1

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-09 Thread Wei Yang
On Tue, May 07, 2019 at 08:38:00PM +0200, David Hildenbrand wrote:
>Only memory to be added to the buddy and to be onlined/offlined by
>user space using memory block devices needs (and should have!) memory
>block devices.
>
>Factor out creation of memory block devices Create all devices after
>arch_add_memory() succeeded. We can later drop the want_memblock parameter,
>because it is now effectively stale.
>
>Only after memory block devices have been added, memory can be onlined
>by user space. This implies, that memory is not visible to user space at
>all before arch_add_memory() succeeded.
>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: David Hildenbrand 
>Cc: "mike.tra...@hpe.com" 
>Cc: Andrew Morton 
>Cc: Ingo Molnar 
>Cc: Andrew Banman 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Qian Cai 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c  | 70 ++
> include/linux/memory.h |  2 +-
> mm/memory_hotplug.c| 15 -
> 3 files changed, 53 insertions(+), 34 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 6e0cb4fda179..862c202a18ca 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
>   return 0;
> }
> 
>+static void unregister_memory(struct memory_block *memory)
>+{
>+  BUG_ON(memory->dev.bus != _subsys);
>+
>+  /* drop the ref. we got via find_memory_block() */
>+  put_device(>dev);
>+  device_unregister(>dev);
>+}
>+
> /*
>- * need an interface for the VM to add new memory regions,
>- * but without onlining it.
>+ * Create memory block devices for the given memory area. Start and size
>+ * have to be aligned to memory block granularity. Memory block devices
>+ * will be initialized as offline.
>  */
>-int hotplug_memory_register(int nid, struct mem_section *section)
>+int hotplug_memory_register(unsigned long start, unsigned long size)
> {
>-  int ret = 0;
>+  unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
>+  unsigned long start_pfn = PFN_DOWN(start);
>+  unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
>+  unsigned long pfn;
>   struct memory_block *mem;
>+  int ret = 0;
> 
>-  mutex_lock(_sysfs_mutex);
>+  BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
>+  BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));
> 
>-  mem = find_memory_block(section);
>-  if (mem) {
>-  mem->section_count++;
>-  put_device(>dev);
>-  } else {
>-  ret = init_memory_block(, section, MEM_OFFLINE);
>+  mutex_lock(_sysfs_mutex);
>+  for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
>+  mem = find_memory_block(__pfn_to_section(pfn));
>+  if (mem) {
>+  WARN_ON_ONCE(false);

One question here, the purpose of WARN_ON_ONCE(false) is? Would we trigger
this?

>+  put_device(>dev);
>+  continue;
>+  }
>+  ret = init_memory_block(, __pfn_to_section(pfn),
>+  MEM_OFFLINE);
>   if (ret)
>-  goto out;
>-  mem->section_count++;
>+  break;
>+  mem->section_count = memory_block_size_bytes() /
>+   MIN_MEMORY_BLOCK_SIZE;

Maybe we can leverage sections_per_block variable.

mem->section_count = sections_per_block;

>+  }
>+  if (ret) {
>+  end_pfn = pfn;
>+      for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
>+  mem = find_memory_block(__pfn_to_section(pfn));
>+  if (!mem)
>+  continue;
>+  mem->section_count = 0;
>+  unregister_memory(mem);
>+  }
>   }

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-09 Thread Wei Yang
On Tue, May 07, 2019 at 08:38:00PM +0200, David Hildenbrand wrote:
>Only memory to be added to the buddy and to be onlined/offlined by
>user space using memory block devices needs (and should have!) memory
>block devices.
>
>Factor out creation of memory block devices Create all devices after
>arch_add_memory() succeeded. We can later drop the want_memblock parameter,
>because it is now effectively stale.
>
>Only after memory block devices have been added, memory can be onlined
>by user space. This implies, that memory is not visible to user space at
>all before arch_add_memory() succeeded.
>
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: David Hildenbrand 
>Cc: "mike.tra...@hpe.com" 
>Cc: Andrew Morton 
>Cc: Ingo Molnar 
>Cc: Andrew Banman 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: Pavel Tatashin 
>Cc: Qian Cai 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 
>---
> drivers/base/memory.c  | 70 ++
> include/linux/memory.h |  2 +-
> mm/memory_hotplug.c| 15 -
> 3 files changed, 53 insertions(+), 34 deletions(-)
>
>diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>index 6e0cb4fda179..862c202a18ca 100644
>--- a/drivers/base/memory.c
>+++ b/drivers/base/memory.c
>@@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
>   return 0;
> }
> 
>+static void unregister_memory(struct memory_block *memory)
>+{
>+  BUG_ON(memory->dev.bus != _subsys);
>+
>+  /* drop the ref. we got via find_memory_block() */
>+  put_device(>dev);
>+  device_unregister(>dev);
>+}
>+
> /*
>- * need an interface for the VM to add new memory regions,
>- * but without onlining it.
>+ * Create memory block devices for the given memory area. Start and size
>+ * have to be aligned to memory block granularity. Memory block devices
>+ * will be initialized as offline.
>  */
>-int hotplug_memory_register(int nid, struct mem_section *section)
>+int hotplug_memory_register(unsigned long start, unsigned long size)
> {
>-  int ret = 0;
>+  unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
>+  unsigned long start_pfn = PFN_DOWN(start);
>+  unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
>+  unsigned long pfn;
>   struct memory_block *mem;
>+  int ret = 0;
> 
>-  mutex_lock(_sysfs_mutex);
>+  BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
>+  BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));

After this change, the call flow looks like this:

add_memory_resource
check_hotplug_memory_range
hotplug_memory_register

Since in check_hotplug_memory_range() has checked the boundary, do we need to
check here again?

-- 
Wei Yang
Help you, Help me


Re: [PATCH v2 1/8] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()

2019-05-09 Thread Wei Yang
On Tue, May 07, 2019 at 08:37:57PM +0200, David Hildenbrand wrote:
>By converting start and size to page granularity, we actually ignore
>unaligned parts within a page instead of properly bailing out with an
>error.
>
>Cc: Andrew Morton 
>Cc: Oscar Salvador 
>Cc: Michal Hocko 
>Cc: David Hildenbrand 
>Cc: Pavel Tatashin 
>Cc: Qian Cai 
>Cc: Wei Yang 
>Cc: Arun KS 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 

Reviewed-by: Wei Yang 


-- 
Wei Yang
Help you, Help me


Re: [PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types

2018-12-03 Thread Wei Yang
[...]
>>>
>>> +   if (type == MEMORY_BLOCK_NONE)
>>> +   return -EINVAL;
>> 
>> No one will pass in this value. Can we omit this check for now?
>
>I could move it to patch nr 2 I guess, but as I introduce
>MEMORY_BLOCK_NONE here it made sense to keep it in here.
>

Yes, this make sense to me now.

>(and I think at least for now it makes sense to not squash patch 1 and
>2, to easier discuss the new user interface/concept introduced in this
>patch).
>
>Thanks!
>
>-- 
>
>Thanks,
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me


Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-12-01 Thread Wei Yang
On Fri, Nov 30, 2018 at 06:59:18PM +0100, David Hildenbrand wrote:
>This is the second approach, introducing more meaningful memory block
>types and not changing online behavior in the kernel. It is based on
>latest linux-next.
>
>As we found out during dicussion, user space should always handle onlining
>of memory, in any case. However in order to make smart decisions in user
>space about if and how to online memory, we have to export more information
>about memory blocks. This way, we can formulate rules in user space.
>
>One such information is the type of memory block we are talking about.
>This helps to answer some questions like:
>- Does this memory block belong to a DIMM?
>- Can this DIMM theoretically ever be unplugged again?
>- Was this memory added by a balloon driver that will rely on balloon
>  inflation to remove chunks of that memory again? Which zone is advised?
>- Is this special standby memory on s390x that is usually not automatically
>  onlined?
>
>And in short it helps to answer to some extend (excluding zone imbalances)
>- Should I online this memory block?
>- To which zone should I online this memory block?
>... of course special use cases will result in different anwers. But that's
>why user space has control of onlining memory.
>
>More details can be found in Patch 1 and Patch 3.
>Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>
>
>Example:
>$ udevadm info -q all -a /sys/devices/system/memory/memory0
>   KERNEL=="memory0"
>   SUBSYSTEM=="memory"
>   DRIVER==""
>   ATTR{online}=="1"
>   ATTR{phys_device}=="0"
>   ATTR{phys_index}==""
>   ATTR{removable}=="0"
>   ATTR{state}=="online"
>   ATTR{type}=="boot"
>   ATTR{valid_zones}=="none"
>$ udevadm info -q all -a /sys/devices/system/memory/memory90
>   KERNEL=="memory90"
>   SUBSYSTEM=="memory"
>   DRIVER==""
>   ATTR{online}=="1"
>   ATTR{phys_device}=="0"
>   ATTR{phys_index}=="005a"
>   ATTR{removable}=="1"
>   ATTR{state}=="online"
>   ATTR{type}=="dimm"
>   ATTR{valid_zones}=="Normal"
>
>
>RFC -> RFCv2:
>- Now also taking care of PPC (somehow missed it :/ )
>- Split the series up to some degree (some ideas on how to split up patch 3
>  would be very welcome)
>- Introduce more memory block types. Turns out abstracting too much was
>  rather confusing and not helpful. Properly document them.
>
>Notes:
>- I wanted to convert the enum of types into a named enum but this
>  provoked all kinds of different errors. For now, I am doing it just like
>  the other types (e.g. online_type) we are using in that context.
>- The "removable" property should never have been named like that. It
>  should have been "offlinable". Can we still rename that? E.g. boot memory
>  is sometimes marked as removable ...
>

This make sense to me. Remove usually describe physical hotplug phase,
if I am correct. 

-- 
Wei Yang
Help you, Help me


Re: [PATCH RFCv2 2/4] mm/memory_hotplug: Replace "bool want_memblock" by "int type"

2018-11-30 Thread Wei Yang
On Fri, Nov 30, 2018 at 06:59:20PM +0100, David Hildenbrand wrote:
>Let's pass a memory block type instead. Pass "MEMORY_BLOCK_NONE" for device
>memory and for now "MEMORY_BLOCK_UNSPECIFIED" for anything else. No
>functional change.

I would suggest to put more words to this.

"
Function arch_add_memory()'s last parameter *want_memblock* is used to
determin whether it is necessary to create a corresponding memory block
device. After introducing the memory block type, this patch replaces the
bool type *want_memblock* with memory block type with following rules
for now:

  * Pass "MEMORY_BLOCK_NONE" for device memory
  * Pass "MEMORY_BLOCK_UNSPECIFIED" for anything else 

Since this parameter is passed deep to __add_section(), all its
descendents are effected. Below lists those descendents.

  arch_add_memory()
add_pages()
  __add_pages()
__add_section()

"

>
>Cc: Tony Luck 
>Cc: Fenghua Yu 
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: Martin Schwidefsky 
>Cc: Heiko Carstens 
>Cc: Yoshinori Sato 
>Cc: Rich Felker 
>Cc: Dave Hansen 
>Cc: Andy Lutomirski 
>Cc: Peter Zijlstra 
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: Borislav Petkov 
>Cc: "H. Peter Anvin" 
>Cc: x...@kernel.org
>Cc: Greg Kroah-Hartman 
>Cc: "Rafael J. Wysocki" 
>Cc: Andrew Morton 
>Cc: Mike Rapoport 
>Cc: Michal Hocko 
>Cc: Dan Williams 
>Cc: "Kirill A. Shutemov" 
>Cc: Oscar Salvador 
>Cc: Nicholas Piggin 
>Cc: Stephen Rothwell 
>Cc: Christophe Leroy 
>Cc: "Jonathan Neusch??fer" 
>Cc: Mauricio Faria de Oliveira 
>Cc: Vasily Gorbik 
>Cc: Arun KS 
>Cc: Rob Herring 
>Cc: Pavel Tatashin 
>Cc: "mike.tra...@hpe.com" 
>Cc: Joonsoo Kim 
>Cc: Wei Yang 
>Cc: Logan Gunthorpe 
>Cc: "J??r??me Glisse" 
>Cc: "Jan H. Sch??nherr" 
>Cc: Dave Jiang 
>Cc: Matthew Wilcox 
>Cc: Mathieu Malaterre 
>Signed-off-by: David Hildenbrand 
>---
> arch/ia64/mm/init.c|  4 ++--
> arch/powerpc/mm/mem.c  |  4 ++--
> arch/s390/mm/init.c|  4 ++--
> arch/sh/mm/init.c  |  4 ++--
> arch/x86/mm/init_32.c  |  4 ++--
> arch/x86/mm/init_64.c  |  8 
> drivers/base/memory.c  | 11 +++
> include/linux/memory.h |  2 +-
> include/linux/memory_hotplug.h | 12 ++--
> kernel/memremap.c  |  6 --
> mm/memory_hotplug.c| 16 
> 11 files changed, 40 insertions(+), 35 deletions(-)
>
>diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
>index 904fe55e10fc..408635d2902f 100644
>--- a/arch/ia64/mm/init.c
>+++ b/arch/ia64/mm/init.c
>@@ -646,13 +646,13 @@ mem_init (void)
> 
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>-  bool want_memblock)
>+  int type)
> {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
>   int ret;
> 
>-  ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
>+  ret = __add_pages(nid, start_pfn, nr_pages, altmap, type);
>   if (ret)
>   printk("%s: Problem encountered in __add_pages() as ret=%d\n",
>  __func__,  ret);
>diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>index b3c9ee5c4f78..e394637da270 100644
>--- a/arch/powerpc/mm/mem.c
>+++ b/arch/powerpc/mm/mem.c
>@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, 
>unsigned long end)
> }
> 
> int __meminit arch_add_memory(int nid, u64 start, u64 size, struct 
> vmem_altmap *altmap,
>-  bool want_memblock)
>+int type)
> {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
>@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 
>size, struct vmem_altmap *
>   }
>   flush_inval_dcache_range(start, start + size);
> 
>-  return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
>+  return __add_pages(nid, start_pfn, nr_pages, altmap, type);
> }
> 
> #ifdef CONFIG_MEMORY_HOTREMOVE
>diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>index 3e82f66d5c61..ba2c56328e6d 100644
>--- a/arch/s390/mm/init.c
>+++ b/arch/s390/mm/init.c
>@@ -225,7 +225,7 @@ device_initcall(s390_cma_mem_init);
> #endif /* CONFIG_CMA */
> 
> int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>-  bool want_memblock)
>+  int type)
&g

Re: [PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types

2018-11-30 Thread Wei Yang
of(*mem), GFP_KERNEL);
>   if (!mem)
>   return -ENOMEM;
>@@ -675,6 +704,7 @@ static int init_memory_block(struct memory_block **memory,
>   mem->state = state;
>   start_pfn = section_nr_to_pfn(mem->start_section_nr);
>   mem->phys_device = arch_get_memory_phys_device(start_pfn);
>+  mem->type = type;
> 
>   ret = register_memory(mem);
> 
>@@ -699,7 +729,8 @@ static int add_memory_block(int base_section_nr)
> 
>   if (section_count == 0)
>   return 0;
>-  ret = init_memory_block(, __nr_to_section(section_nr), MEM_ONLINE);
>+  ret = init_memory_block(, __nr_to_section(section_nr), MEM_ONLINE,
>+  MEMORY_BLOCK_BOOT);
>   if (ret)
>   return ret;
>   mem->section_count = section_count;
>@@ -722,7 +753,8 @@ int hotplug_memory_register(int nid, struct mem_section 
>*section)
>   mem->section_count++;
>   put_device(>dev);
>   } else {
>-  ret = init_memory_block(, section, MEM_OFFLINE);
>+  ret = init_memory_block(, section, MEM_OFFLINE,
>+  MEMORY_BLOCK_UNSPECIFIED);
>   if (ret)
>   goto out;
>   mem->section_count++;
>diff --git a/include/linux/memory.h b/include/linux/memory.h
>index d75ec88ca09d..06268e96e0da 100644
>--- a/include/linux/memory.h
>+++ b/include/linux/memory.h
>@@ -34,12 +34,39 @@ struct memory_block {
>   int (*phys_callback)(struct memory_block *);
>   struct device dev;
>   int nid;/* NID for this memory block */
>+  int type;   /* type of this memory block */
> };
> 
> int arch_get_memory_phys_device(unsigned long start_pfn);
> unsigned long memory_block_size_bytes(void);
> int set_memory_block_size_order(unsigned int order);
> 
>+/*
>+ * Memory block types allow user space to formulate rules if and how to
>+ * online memory blocks. The types are exposed to user space as text
>+ * strings in sysfs.
>+ *
>+ * MEMORY_BLOCK_NONE:
>+ *  No memory block is to be created (e.g. device memory). Not exposed to
>+ *  user space.
>+ *
>+ * MEMORY_BLOCK_UNSPECIFIED:
>+ *  The type of memory block was not further specified when adding the
>+ *  memory block.
>+ *
>+ * MEMORY_BLOCK_BOOT:
>+ *  This memory block was added during boot by the basic system. No
>+ *  specific device driver takes care of this memory block. This memory
>+ *  block type is onlined automatically by the kernel during boot and might
>+ *  later be managed by a different device driver, in which case the type
>+ *  might change.
>+ */
>+enum {
>+  MEMORY_BLOCK_NONE = 0,
>+  MEMORY_BLOCK_UNSPECIFIED,
>+  MEMORY_BLOCK_BOOT,
>+};
>+
> /* These states are exposed to userspace as text strings in sysfs */
> #define   MEM_ONLINE  (1<<0) /* exposed to userspace */
> #define   MEM_GOING_OFFLINE   (1<<1) /* exposed to userspace */
>-- 
>2.17.2

-- 
Wei Yang
Help you, Help me


Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Wei Yang
On Wed, Mar 28, 2018 at 09:55:07AM -0700, Kees Cook wrote:
>On Wed, Mar 28, 2018 at 8:26 AM, Shea Levy <s...@shealevy.com> wrote:
>> Now only those architectures that have custom initrd free requirements
>> need to define free_initrd_mem.
>>
>> Signed-off-by: Shea Levy <s...@shealevy.com>
>
>Yay consolidation! :)
>
>> --- a/usr/Kconfig
>> +++ b/usr/Kconfig
>> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION
>> default ".lzma" if RD_LZMA
>> default ".bz2"  if RD_BZIP2
>> default ""
>> +
>> +config HAVE_ARCH_FREE_INITRD_MEM
>> +   bool
>> +   default n
>
>If you keep the Kconfig, you can leave off "default n", and I'd
>suggest adding a help section just to describe what the per-arch
>responsibilities are when select-ing the config. (See
>HAVE_ARCH_SECCOMP_FILTER for an example.)
>

One question about this change.

The original code would "select" HAVE_ARCH_FREE_INITRD_MEM on those arch.
After this change, we need to manually "select" this?

>-Kees
>
>-- 
>Kees Cook
>Pixel Security

-- 
Wei Yang
Help you, Help me


Re: [PATCH] powerpc/iommu: use iommu_num_pages() to calculate the number of iommu page

2015-11-11 Thread Wei Yang
Someone willing to take a look?

On Fri, Oct 02, 2015 at 06:51:59AM +0800, Wei Yang wrote:
>On Thu, Oct 01, 2015 at 02:15:45PM +1000, Michael Ellerman wrote:
>>On Thu, 2015-10-01 at 07:50 +0800, Wei Yang wrote:
>>> Hmm... some comments on this one? like it or not?
>>
>>It sounds like it's fixing a bug, but you don't really say. Have you seen this
>>fail in the wild?
>
>Hmm... as described in the commit log, this would be a bug when 
>PAGE_SIZE is much smaller than IOMMU Page Size. This configuration doesn't
>happen on current platform. So I didn't see this failure yet.
>
>>
>>Which commit introduced the breakage?
>
>Hmm...  maybe we could say it is 'commit d084775738b7
>'.
>>From this commit, powerpc iommu supports dynamic iommu page size. 
>
>Before this commit, the size is aligned with PAGE_SIZE, so the value after
>shift would be non-zero. After this commit, when PAGE_SIZE is smaller than
>the IOMMU Page Size, shift right will make the size 0.
>
>>
>>cheers
>>
>
>-- 
>Richard Yang
>Help you, Help me

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V13 0/9] VF EEH on Power8

2015-11-08 Thread Wei Yang
On Mon, Nov 09, 2015 at 10:53:17AM +1100, Alexey Kardashevskiy wrote:
>On 11/08/2015 10:30 AM, Wei Yang wrote:
>>This patchset enables EEH on SRIOV VFs. The general idea is to create proper
>>VF edev and VF PE and handle them properly.
>>
>>Different from the Bus PE, VF PE just contain one VF. This introduces the
>>difference of EEH error handling on a VF PE. Generally, it has several
>>differences.
>>
>>First, the VF's removal and re-enumerate rely on its PF. VF has a tight
>>relationship between its PF. This is not proper to enumerate a VF by usual
>>scan procedure. That's why virtfn_add/virtfn_remove are exported in this patch
>>set.
>>
>>Second, the reset/restore of a VF is done in kernel space. FW is not aware of
>>the VF, this means the usual reset function done in FW will not work. One of
>>the patch will imitate the reset/restore function in kernel space.
>>
>>Third, the VF may be removed during the PF's error_detected function. In this
>>case, the original error_detected->slot_reset->resume sequence is not proper
>>to those removed VFs, since they are re-created by PF in a fresh state. A flag
>>in eeh_dev is introduce to mark the eeh_dev is in error state. By doing so, we
>>track whether this device needs to be reset or not.
>>
>>This has been tested both on host and in guest on Power8 with latest kernel
>>version.
>
>
>This does not apply on top of neither Linus master tree
>(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git sha1
>ce5c2d2) nor Michael's PPC next tree
>(git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git sha1
>8bdf2023). What did you base your work on?
>

I didn't based this on the latest code.

This is based on v4.3, commit "6a13feb Linux 4.3".

>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 4/9] powerpc/eeh: Cache only BARs, not windows or IOV BARs

2015-11-07 Thread Wei Yang
This restricts the EEH address cache to use only the first 7 BARs. This
makes __eeh_addr_cache_insert_dev() ignore PCI bridge window and IOV BARs.
As the result of this change, eeh_addr_cache_get_dev() will return VFs from
VF's resource addresses instead of parent PFs.

This removes extra check for a PCI bridge as we limit
__eeh_addr_cache_insert_dev() to 7 BARs and this effectively excludes PCI
bridges from being cached.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_cache.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index a1e86e1..ddbcfab 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -195,8 +195,11 @@ static void __eeh_addr_cache_insert_dev(struct pci_dev 
*dev)
return;
}
 
-   /* Walk resources on this device, poke them into the tree */
-   for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
+   /*
+* Walk resources on this device, poke the first 7 (6 normal BAR and 1
+* ROM BAR) into the tree.
+*/
+   for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start = pci_resource_start(dev,i);
resource_size_t end = pci_resource_end(dev,i);
unsigned long flags = pci_resource_flags(dev,i);
@@ -222,10 +225,6 @@ void eeh_addr_cache_insert_dev(struct pci_dev *dev)
 {
unsigned long flags;
 
-   /* Ignore PCI bridges */
-   if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
-   return;
-
spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
__eeh_addr_cache_insert_dev(dev);
spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 5/9] powerpc/powernv: EEH device for VF

2015-11-07 Thread Wei Yang
VFs and their corresponding pci_dn instances are created and released
dynamically as their PF's SRIOV capability is enabled and disabled.
The patch creates and releases EEH devices for VFs when creating and
releasing their pci_dn instances, which means EEH devices and pci_dn
instances have same life cycle. Also, VF's EEH device is identified
by (struct eeh_dev::physfn).

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  1 +
 arch/powerpc/kernel/pci_dn.c   | 13 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c5eb86f..6c383ad 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -140,6 +140,7 @@ struct eeh_dev {
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
 
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index b3b4df9..5091b05 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -178,7 +178,9 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
 struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 {
 #ifdef CONFIG_PCI_IOV
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pci_dn *parent, *pdn;
+   struct eeh_dev *edev;
int i;
 
/* Only support IOV for now */
@@ -204,6 +206,10 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 __func__, i);
return NULL;
}
+   eeh_dev_init(pdn, hose);
+   edev = pdn_to_eeh_dev(pdn);
+   BUG_ON(!edev);
+   edev->physfn = pdev;
}
 #endif /* CONFIG_PCI_IOV */
 
@@ -252,10 +258,17 @@ void remove_dev_pci_data(struct pci_dev *pdev)
for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
list_for_each_entry_safe(pdn, tmp,
>child_list, list) {
+   struct eeh_dev *edev;
if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
continue;
 
+   edev = pdn_to_eeh_dev(pdn);
+   if (edev) {
+   pdn->edev = NULL;
+   kfree(edev);
+   }
+
if (!list_empty(>list))
list_del(>list);
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 8/9] powerpc/powernv: Support PCI config restore for VFs

2015-11-07 Thread Wei Yang
After PE reset, OPAL API opal_pci_reinit() is called on all devices
contained in the PE to reinitialize them. While skiboot is not aware of
VFs, we have to implement the function in kernel to reinitialize VFs after
reset on PE for VFs.

In this patch, two functions pnv_pci_fixup_vf_mps() and
pnv_eeh_restore_vf_config() both manipulate the MPS of the VF, since for a
VF it has three cases.

1. Normal creation for a VF
   In this case, pnv_pci_fixup_vf_mps() is called to make the MPS a proper
   value compared with its parent.
2. EEH recovery without VF removed
   In this case, MPS is stored in pci_dn and pnv_eeh_restore_vf_config() is
   called to restore it and reinitialize other part.
3. EEH recovery with VF removed
   In this case, VF will be removed then re-created. Both functions are
   called. First pnv_pci_fixup_vf_mps() is called to store the proper MPS
   to pci_dn and then pnv_eeh_restore_vf_config() is called to do proper
   thing.

This patch introduces two functions:
   pnv_pci_fixup_vf_mps() to fixup the PCI device's MPS to make sure it is
   smaller than parent's and store this value in pci_dn for future use.
   pnv_eeh_restore_vf_config() to re-initialize on VF by restore MPS,
   disable completion timeout, enable SERR, etc.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h|  1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c | 70 +++-
 arch/powerpc/platforms/powernv/pci.c | 18 +++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 843dd3a2..9b365d6 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -219,6 +219,7 @@ struct pci_dn {
 #define IODA_INVALID_M64(-1)
int (*m64_map)[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
+   int mps;/* Maximum Payload Size */
 #endif
struct list_head child_list;
struct list_head list;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 4de247a..9019458 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1623,6 +1623,67 @@ static int pnv_eeh_next_error(struct eeh_pe **pe)
return ret;
 }
 
+static int pnv_eeh_restore_vf_config(struct pci_dn *pdn)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 devctl, cmd, cap2, aer_capctl;
+   int old_mps;
+
+   /* Restore MPS */
+   if (edev->pcie_cap) {
+   old_mps = (ffs(pdn->mps) - 8) << 5;
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+2, );
+   devctl &= ~PCI_EXP_DEVCTL_PAYLOAD;
+   devctl |= old_mps;
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+ 2, devctl);
+   }
+
+   /* Disable Completion Timeout */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP2,
+4, );
+   if (cap2 & 0x10) {
+   eeh_ops->read_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, );
+   cap2 |= 0x10;
+   eeh_ops->write_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, cap2);
+   }
+   }
+
+   /* Enable SERR and parity checking */
+   eeh_ops->read_config(pdn, PCI_COMMAND, 2, );
+   cmd |= (PCI_COMMAND_PARITY | PCI_COMMAND_SERR);
+   eeh_ops->write_config(pdn, PCI_COMMAND, 2, cmd);
+
+   /* Enable report various errors */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, );
+   devctl &= ~PCI_EXP_DEVCTL_CERE;
+   devctl |= (PCI_EXP_DEVCTL_NFERE |
+  PCI_EXP_DEVCTL_FERE |
+  PCI_EXP_DEVCTL_URRE);
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, devctl);
+   }
+
+   /* Enable ECRC generation and check */
+   if (edev->pcie_cap && edev->aer_cap) {
+   eeh_ops->read_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, _capctl);
+   aer_capctl |= (PCI_ERR_CAP_ECRC_GENE | PCI_ERR_CAP_ECRC_CHKE);
+   eeh_ops->write_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, aer_capctl);
+   }
+
+   return 0;
+}
+
 static int pnv_eeh_restore_confi

[PATCH V13 9/9] powerpc/eeh: Support error recovery for VF PE

2015-11-07 Thread Wei Yang
PFs are enumerated on PCI bus, while VFs are created by PF's driver.

In EEH recovery, it has two cases:
1. Device and driver is EEH aware, error handlers are called.
2. Device and driver is not EEH aware, un-plug the device and plug it again
by enumerating it.

The special thing happens on the second case. For a PF, we could use the
original pci core to enumerate the bus, while for VF we need to record the
VFs which aer un-plugged then plug it again.

Also The patch caches the VF index in pci_dn, which can be used to
calculate VF's bus, device and function number. Those information helps to
locate the VF's PCI device instance when doing hotplug during EEH recovery
if necessary.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h|   2 +
 arch/powerpc/include/asm/pci-bridge.h |   1 +
 arch/powerpc/kernel/eeh.c |   8 +++
 arch/powerpc/kernel/eeh_dev.c |   1 +
 arch/powerpc/kernel/eeh_driver.c  | 132 +++---
 arch/powerpc/kernel/eeh_pe.c  |   3 +-
 arch/powerpc/kernel/pci_dn.c  |   4 +-
 7 files changed, 123 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 331c856..4448433 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -139,9 +139,11 @@ struct eeh_dev {
int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
+   struct list_head rmv_list;  /* Record the removed edev  */
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   bool   in_error;/* Error flag for eeh_dev   */
struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 9b365d6..533e6e9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -211,6 +211,7 @@ struct pci_dn {
 #define IODA_INVALID_PE(-1)
 #ifdef CONFIG_PPC_POWERNV
int pe_number;
+   int vf_index;   /* VF index in the PF */
 #ifdef CONFIG_PCI_IOV
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 41a4b30..99c961a 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1245,6 +1245,14 @@ void eeh_remove_device(struct pci_dev *dev)
 * from the parent PE during the BAR resotre.
 */
edev->pdev = NULL;
+
+   /*
+* The flag "in_error" is used to trace EEH devices for VFs
+* in error state or not. It's set in eeh_report_error(). If
+* it's not set, eeh_report_{reset,resume}() won't be called
+* for the VF EEH device.
+*/
+   edev->in_error = false;
dev->dev.archdata.edev = NULL;
if (!(edev->pe->state & EEH_PE_KEEP))
eeh_rmv_from_parent_pe(edev);
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index aabba94..7815095 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -67,6 +67,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
edev->pdn = pdn;
edev->phb = phb;
INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>rmv_list);
 
return NULL;
 }
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 89eb4bc..f25428a 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -34,6 +34,11 @@
 #include 
 #include 
 
+struct eeh_rmv_data {
+   struct list_head edev_list;
+   int removed;
+};
+
 /**
  * eeh_pcid_name - Retrieve name of PCI device driver
  * @pdev: PCI device
@@ -211,6 +216,7 @@ static void *eeh_report_error(void *data, void *userdata)
if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
if (*res == PCI_ERS_RESULT_NONE) *res = rc;
 
+   edev->in_error = true;
eeh_pcid_put(dev);
return NULL;
 }
@@ -282,7 +288,8 @@ static void *eeh_report_reset(void *data, void *userdata)
 
if (!driver->err_handler ||
!driver->err_handler->slot_reset ||
-   (edev->mode & EEH_DEV_NO_HANDLER)) {
+   (edev->mode & EEH_DEV_NO_HANDLER) ||
+   (!edev->in_error)) {
eeh_pcid_put(dev);
return NULL;
}
@@ -326,6 +333,7 @@ static v

[PATCH V13 3/9] powerpc/pci: Remove VFs prior to PF

2015-11-07 Thread Wei Yang
As commit ac205b7bb72f ("PCI: make sriov work with hotplug remove")
indicates, VFs which is on the same PCI bus as their PF, should be removed
before the PF. Otherwise, the PCI hot unplugging of PCI devices on the PCI
bus would cause kernel crash.

The patch applies the above pattern to PowerPC PCI hotplug path.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index 7f9ed0c..59c4361 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -55,7 +55,7 @@ void pcibios_remove_pci_devices(struct pci_bus *bus)
 
pr_debug("PCI: Removing devices on bus %04x:%02x\n",
 pci_domain_nr(bus),  bus->number);
-   list_for_each_entry_safe(dev, tmp, >devices, bus_list) {
+   list_for_each_entry_safe_reverse(dev, tmp, >devices, bus_list) {
pr_debug("   Removing %s...\n", pci_name(dev));
pci_stop_and_remove_bus_device(dev);
}
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 7/9] powerpc/powernv: Support EEH reset for VF PE

2015-11-07 Thread Wei Yang
PEs for VFs don't have primary bus. So they have to have their own reset
backend, which is used during EEH recovery. The patch implements the reset
backend for VF's PE by issuing FLR or AF FLR to the VFs, which are contained
in the PE.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |   1 +
 arch/powerpc/kernel/eeh.c|   9 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c | 133 ++-
 3 files changed, 139 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index ec21f8f..331c856 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -136,6 +136,7 @@ struct eeh_dev {
int pcix_cap;   /* Saved PCIx capability*/
int pcie_cap;   /* Saved PCIe capability*/
int aer_cap;/* Saved AER capability */
+   int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
struct pci_controller *phb; /* Associated PHB   */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index e968533..41a4b30 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -760,7 +760,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
case pcie_deassert_reset:
eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
eeh_unfreeze_pe(pe, false);
-   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
break;
@@ -768,14 +769,16 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
enum pcie_reset_state stat
eeh_pe_state_mark_with_cfg(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_HOT);
break;
case pcie_warm_reset:
eeh_pe_state_mark_with_cfg(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_FUNDAMENTAL);
break;
default:
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 131c7d0..4de247a 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -404,6 +404,7 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
edev->pcix_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_PCIX);
edev->pcie_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_EXP);
edev->aer_cap  = pnv_eeh_find_ecap(pdn, PCI_EXT_CAP_ID_ERR);
+   edev->af_cap   = pnv_eeh_find_cap(pdn, PCI_CAP_ID_AF);
if ((edev->class_code >> 8) == PCI_CLASS_BRIDGE_PCI) {
edev->mode |= EEH_DEV_BRIDGE;
if (edev->pcie_cap) {
@@ -893,6 +894,126 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int 
option)
return 0;
 }
 
+static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, int pos,
+u16 mask, const char *reset_type)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   int i, status = 0;
+
+   /* Wait for Transaction Pending bit to be cleared */
+   for (i = 0; i < 4; i++) {
+   eeh_ops->read_config(pdn, pos, 2, );
+   if (!(status & mask))
+   return;
+
+   msleep((1 << i) * 100);
+   }
+
+   pr_warn("%s: Pending transaction while issuing %s FLR to 
%04x:%02x:%02x.%01x\n",
+   __func__, reset_type,
+   edev->phb->global_number, pdn->busno,
+   PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
+}
+
+static int pnv_eeh_do_flr(struct pci_dn *pdn, int option)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 reg = 0;
+
+   if (WARN_ON(!edev->pcie_cap))
+   return -ENOTTY;
+
+   eeh_ops->read_config

[PATCH V13 2/9] PCI: Add pcibios_bus_add_device() weak function

2015-11-07 Thread Wei Yang
Add a weak function pcibios_bus_add_device() for arch dependent code could
do proper setup. For example, powerpc could setup EEH related resources.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/bus.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index d3346d2..2b8b756 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -269,6 +269,7 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx)
 
 void __weak pcibios_resource_survey_bus(struct pci_bus *bus) { }
 
+void __weak pcibios_bus_add_device(struct pci_dev *dev) { }
 /**
  * pci_bus_add_device - start driver for a single device
  * @dev: device to add
@@ -279,6 +280,8 @@ void pci_bus_add_device(struct pci_dev *dev)
 {
int retval;
 
+   pcibios_bus_add_device(dev);
+
/*
 * Can not put in pci_device_add yet because resources
 * are not assigned yet for some devices.
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 1/9] PCI/IOV: Rename and export virtfn_add/virtfn_remove

2015-11-07 Thread Wei Yang
During EEH recovery, hotplug is applied to the devices which don't
have drivers or their drivers don't support EEH. However, the hotplug,
which was implemented based on PCI bus, can't be applied to VF directly.

Rename virtn_{add,remove}() and export them so they can be used in PCI
hotplug during EEH recovery.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/iov.c   | 10 +-
 include/linux/pci.h |  8 
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..cc941dd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -108,7 +108,7 @@ resource_size_t pci_iov_resource_size(struct pci_dev *dev, 
int resno)
return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
 }
 
-static int virtfn_add(struct pci_dev *dev, int id, int reset)
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
 {
int i;
int rc = -ENOMEM;
@@ -183,7 +183,7 @@ failed:
return rc;
 }
 
-static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset)
 {
char buf[VIRTFN_ID_LEN];
struct pci_dev *virtfn;
@@ -320,7 +320,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
}
 
for (i = 0; i < initial; i++) {
-   rc = virtfn_add(dev, i, 0);
+   rc = pci_iov_virtfn_add(dev, i, 0);
if (rc)
goto failed;
}
@@ -332,7 +332,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 
 failed:
for (j = 0; j < i; j++)
-   virtfn_remove(dev, j, 0);
+   pci_iov_virtfn_remove(dev, j, 0);
 
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -361,7 +361,7 @@ static void sriov_disable(struct pci_dev *dev)
return;
 
for (i = 0; i < iov->num_VFs; i++)
-   virtfn_remove(dev, i, 0);
+   pci_iov_virtfn_remove(dev, i, 0);
 
pcibios_sriov_disable(dev);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e90eb22..3628a09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1719,6 +1719,8 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset);
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset);
 int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
@@ -1736,6 +1738,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev 
*dev, int id)
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
+static inline int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   return -ENOSYS;
+}
+static inline void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int 
reset)
+{ }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
 { return 0; }
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V13 0/9] VF EEH on Power8

2015-11-07 Thread Wei Yang
This patchset enables EEH on SRIOV VFs. The general idea is to create proper
VF edev and VF PE and handle them properly.

Different from the Bus PE, VF PE just contain one VF. This introduces the
difference of EEH error handling on a VF PE. Generally, it has several
differences.

First, the VF's removal and re-enumerate rely on its PF. VF has a tight
relationship between its PF. This is not proper to enumerate a VF by usual
scan procedure. That's why virtfn_add/virtfn_remove are exported in this patch
set.

Second, the reset/restore of a VF is done in kernel space. FW is not aware of
the VF, this means the usual reset function done in FW will not work. One of
the patch will imitate the reset/restore function in kernel space.

Third, the VF may be removed during the PF's error_detected function. In this
case, the original error_detected->slot_reset->resume sequence is not proper
to those removed VFs, since they are re-created by PF in a fresh state. A flag
in eeh_dev is introduce to mark the eeh_dev is in error state. By doing so, we
track whether this device needs to be reset or not.

This has been tested both on host and in guest on Power8 with latest kernel
version.

v13:
   * move eeh_rmv_data{} to eeh_driver.c
v12:
   * Rephrase some commit log to make it more clear and specific
   * move vf_index assignment in CONFIG_PPC_POWERNV
   * merge "Cache VF index in pci_dn" with "Support error recovery for VF PE"
   * check the return value after eeh_dev_init() for VF
   * initialize the parameter before pass to read_config()
   * make pnv_pci_fixup_vf_mps() a dedicated patch, which fixup and store mps
 value in pci_dn
v11:
   * move vf_index assignment in marco CONFIG_PPC_POWERNV
   * merge Patch "Cache VF index in pci_dn" into Patch "Support error recovery
 for VF PE" 
v10:
   * rebased on v4.2
   * delete the last patch "powerpc/powernv: compound PE for VFs" since after
 redesign of SRIOV, there is no compound PE for VFs now.
   * add two patches which fix problems found during tests
 powerpc/eeh: Support error recovery for VF PE  
   
 powerpc/eeh: Handle hot removed VF when PF is EEH aware
v9:
   * split pcibios_bus_add_device() into a separate patch
   * Bjorn acked the PCI part and agreed this patch set to be merged from ppc
 tree
   * rebased on mpe/linux.git next branch
v8:
   * fix on checking the return value of pnv_eeh_do_flr()
   * introduced a weak function pcibios_bus_add_device() to create PE for VFs
v7:
   * fix compile error when PCI_IOV is not set
v6:
   * code / commit log refactor by Gavin
v5:
   * remove the compound field, iterate on Master VF PE instead
   * some code refine on PCI config restore and reset on VF
 the wait time for assert and deassert
 PCI device address format
 check on edev->pcie_cap and edev->aer_cap before access them
v4:
   * refine the change logs, comment and code style
   * change pnv_pci_fixup_vf_eeh() to pnv_eeh_vf_final_fixup() and remove the
 CONFIG_PCI_IOV macro
   * reorder patch 5/6 to make the logic more reasonable
   * remove remove_dev_pci_data()
   * remove the EEH_DEV_VF flag, use edev->physfn to identify a VF EEH DEV and
 remove related CONFIG_PCI_IOV macro
   * add the option for VF reset
   * fix the pnv_eeh_cfg_blocked() logic
   * replace pnv_pci_cfg_{read,write} with eeh_ops->{read,write}_config in
 pnv_eeh_vf_restore_config()
   * rename pnv_eeh_vf_restore_config() to pnv_eeh_restore_vf_config()
   * rename pnv_pci_fixup_vf_caps() to pnv_pci_vf_header_fixup() and move it
 to arch/powerpc/platforms/powernv/pci.c
   * add a field compound in pnv_ioda_pe to link compound PEs
   * handle compound PE for VF PEs
v3:
   * add back vf_index in pci_dn to track the VF's index
   * rename ppdev in eeh_dev to physfn for consistency
   * move edev->physfn assignment before dev->dev.archdata.edev is set
   * move pnv_pci_fixup_vf_eeh() and pnv_pci_fixup_vf_caps() to eeh-powernv.c
   * more clear and detail in commit log and comment in code
   * merge eeh_rmv_virt_device() with eeh_rmv_device()
   * move the cfg_blocked check logic from pnv_eeh_read/write_config() to
 pnv_eeh_cfg_blocked()
   * move the vf reset/restore logic into its own patch, two patches are
 created.
 powerpc/powernv: Support PCI config restore for VFs
 powerpc/powernv: Support EEH reset for VFs
   * simplify the vf reset logic
v2:
   * add prefix pci_iov_ to virtfn_add/virtfn_remove
   * use EEH_DEV_VF as a flag for a VF's eeh_dev
   * use eeh_dev instead of edev in change log
   * remove vf_index in eeh_dev, calculate it from pdn->busno and devfn
   * do eeh_add_device_late() and eeh_sysfs_add_device() both after pci_dev is
 well initialized
   * do FLR to reset a VF PE
   * imitate the restore function in FW for VF
   * remove the reverse order patch, since it is still under discussion


W

[PATCH V13 6/9] powerpc/eeh: Create PE for VFs

2015-11-07 Thread Wei Yang
The patch creates PEs for VFs in the weak function
pcibios_bus_add_device(). Those PEs for VFs are identified with newly
introduced flag EEH_PE_VF so that we handle them differently during EEH
recovery.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |  1 +
 arch/powerpc/kernel/eeh_pe.c | 10 --
 arch/powerpc/platforms/powernv/eeh-powernv.c | 16 
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 6c383ad..ec21f8f 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct pci_dn;
 #define EEH_PE_PHB (1 << 1)/* PHB PE*/
 #define EEH_PE_DEVICE  (1 << 2)/* Device PE */
 #define EEH_PE_BUS (1 << 3)/* Bus PE*/
+#define EEH_PE_VF  (1 << 4)/* VF PE */
 
 #define EEH_PE_ISOLATED(1 << 0)/* Isolated PE  
*/
 #define EEH_PE_RECOVERING  (1 << 1)/* Recovering PE*/
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 8654cb1..29240ad 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev 
*edev)
 * EEH device already having associated PE, but
 * the direct parent EEH device doesn't have yet.
 */
-   pdn = pdn ? pdn->parent : NULL;
+   if (edev->physfn)
+   pdn = pci_get_pdn(edev->physfn);
+   else
+   pdn = pdn ? pdn->parent : NULL;
while (pdn) {
/* We're poking out of PCI territory */
parent = pdn_to_eeh_dev(pdn);
@@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
}
 
/* Create a new EEH PE */
-   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
+   if (edev->physfn)
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_VF);
+   else
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
if (!pe) {
pr_err("%s: out of memory!\n", __func__);
return -ENOMEM;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 3bb6acb..131c7d0 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1532,6 +1532,22 @@ static struct eeh_ops pnv_eeh_ops = {
.restore_config = pnv_eeh_restore_config
 };
 
+void pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+}
+
 /**
  * eeh_powernv_init - Register platform dependent EEH operations
  *
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 1/9] PCI/IOV: Rename and export virtfn_add/virtfn_remove

2015-11-03 Thread Wei Yang
During EEH recovery, hotplug is applied to the devices which don't
have drivers or their drivers don't support EEH. However, the hotplug,
which was implemented based on PCI bus, can't be applied to VF directly.

Rename virtn_{add,remove}() and export them so they can be used in PCI
hotplug during EEH recovery.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/iov.c   | 10 +-
 include/linux/pci.h |  8 
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..cc941dd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -108,7 +108,7 @@ resource_size_t pci_iov_resource_size(struct pci_dev *dev, 
int resno)
return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
 }
 
-static int virtfn_add(struct pci_dev *dev, int id, int reset)
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
 {
int i;
int rc = -ENOMEM;
@@ -183,7 +183,7 @@ failed:
return rc;
 }
 
-static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset)
 {
char buf[VIRTFN_ID_LEN];
struct pci_dev *virtfn;
@@ -320,7 +320,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
}
 
for (i = 0; i < initial; i++) {
-   rc = virtfn_add(dev, i, 0);
+   rc = pci_iov_virtfn_add(dev, i, 0);
if (rc)
goto failed;
}
@@ -332,7 +332,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 
 failed:
for (j = 0; j < i; j++)
-   virtfn_remove(dev, j, 0);
+   pci_iov_virtfn_remove(dev, j, 0);
 
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -361,7 +361,7 @@ static void sriov_disable(struct pci_dev *dev)
return;
 
for (i = 0; i < iov->num_VFs; i++)
-   virtfn_remove(dev, i, 0);
+   pci_iov_virtfn_remove(dev, i, 0);
 
pcibios_sriov_disable(dev);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e90eb22..3628a09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1719,6 +1719,8 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset);
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset);
 int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
@@ -1736,6 +1738,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev 
*dev, int id)
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
+static inline int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   return -ENOSYS;
+}
+static inline void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int 
reset)
+{ }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
 { return 0; }
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 0/9] VF EEH on Power8

2015-11-03 Thread Wei Yang
This patchset enables EEH on SRIOV VFs. The general idea is to create proper
VF edev and VF PE and handle them properly.

Different from the Bus PE, VF PE just contain one VF. This introduces the
difference of EEH error handling on a VF PE. Generally, it has several
differences.

First, the VF's removal and re-enumerate rely on its PF. VF has a tight
relationship between its PF. This is not proper to enumerate a VF by usual
scan procedure. That's why virtfn_add/virtfn_remove are exported in this patch
set.

Second, the reset/restore of a VF is done in kernel space. FW is not aware of
the VF, this means the usual reset function done in FW will not work. One of
the patch will imitate the reset/restore function in kernel space.

Third, the VF may be removed during the PF's error_detected function. In this
case, the original error_detected->slot_reset->resume sequence is not proper
to those removed VFs, since they are re-created by PF in a fresh state. A flag
in eeh_dev is introduce to mark the eeh_dev is in error state. By doing so, we
track whether this device needs to be reset or not.

This has been tested both on host and in guest on Power8 with latest kernel
version.

v12:
   * rebased on v4.3
   * Rephrase some commit log to make it more clear and specific
   * move vf_index assignment in CONFIG_PPC_POWERNV
   * merge "Cache VF index in pci_dn" with "Support error recovery for VF PE"
   * check the return value after eeh_dev_init() for VF
   * initialize the parameter before pass to read_config()
   * make pnv_pci_fixup_vf_mps() a dedicated patch, which fixup and store mps
 value in pci_dn
v11:
   * move vf_index assignment in marco CONFIG_PPC_POWERNV
   * merge Patch "Cache VF index in pci_dn" into Patch "Support error recovery
 for VF PE" 
v10:
   * rebased on v4.2
   * delete the last patch "powerpc/powernv: compound PE for VFs" since after
 redesign of SRIOV, there is no compound PE for VFs now.
   * add two patches which fix problems found during tests
 powerpc/eeh: Support error recovery for VF PE  
   
 powerpc/eeh: Handle hot removed VF when PF is EEH aware
v9:
   * split pcibios_bus_add_device() into a separate patch
   * Bjorn acked the PCI part and agreed this patch set to be merged from ppc
 tree
   * rebased on mpe/linux.git next branch
v8:
   * fix on checking the return value of pnv_eeh_do_flr()
   * introduced a weak function pcibios_bus_add_device() to create PE for VFs
v7:
   * fix compile error when PCI_IOV is not set
v6:
   * code / commit log refactor by Gavin
v5:
   * remove the compound field, iterate on Master VF PE instead
   * some code refine on PCI config restore and reset on VF
 the wait time for assert and deassert
 PCI device address format
 check on edev->pcie_cap and edev->aer_cap before access them
v4:
   * refine the change logs, comment and code style
   * change pnv_pci_fixup_vf_eeh() to pnv_eeh_vf_final_fixup() and remove the
 CONFIG_PCI_IOV macro
   * reorder patch 5/6 to make the logic more reasonable
   * remove remove_dev_pci_data()
   * remove the EEH_DEV_VF flag, use edev->physfn to identify a VF EEH DEV and
 remove related CONFIG_PCI_IOV macro
   * add the option for VF reset
   * fix the pnv_eeh_cfg_blocked() logic
   * replace pnv_pci_cfg_{read,write} with eeh_ops->{read,write}_config in
 pnv_eeh_vf_restore_config()
   * rename pnv_eeh_vf_restore_config() to pnv_eeh_restore_vf_config()
   * rename pnv_pci_fixup_vf_caps() to pnv_pci_vf_header_fixup() and move it
 to arch/powerpc/platforms/powernv/pci.c
   * add a field compound in pnv_ioda_pe to link compound PEs
   * handle compound PE for VF PEs
v3:
   * add back vf_index in pci_dn to track the VF's index
   * rename ppdev in eeh_dev to physfn for consistency
   * move edev->physfn assignment before dev->dev.archdata.edev is set
   * move pnv_pci_fixup_vf_eeh() and pnv_pci_fixup_vf_caps() to eeh-powernv.c
   * more clear and detail in commit log and comment in code
   * merge eeh_rmv_virt_device() with eeh_rmv_device()
   * move the cfg_blocked check logic from pnv_eeh_read/write_config() to
 pnv_eeh_cfg_blocked()
   * move the vf reset/restore logic into its own patch, two patches are
 created.
 powerpc/powernv: Support PCI config restore for VFs
 powerpc/powernv: Support EEH reset for VFs
   * simplify the vf reset logic
v2:
   * add prefix pci_iov_ to virtfn_add/virtfn_remove
   * use EEH_DEV_VF as a flag for a VF's eeh_dev
   * use eeh_dev instead of edev in change log
   * remove vf_index in eeh_dev, calculate it from pdn->busno and devfn
   * do eeh_add_device_late() and eeh_sysfs_add_device() both after pci_dev is
 well initialized
   * do FLR to reset a VF PE
   * imitate the restore function in FW for VF
   * remove the reverse order patch, since it is still under discussion

Wei Yang (9):
  PCI/IOV: Rename and expo

[PATCH V12 9/9] powerpc/eeh: Support error recovery for VF PE

2015-11-03 Thread Wei Yang
PFs are enumerated on PCI bus, while VFs are created by PF's driver.

In EEH recovery, it has two cases:
1. Device and driver is EEH aware, error handlers are called.
2. Device and driver is not EEH aware, un-plug the device and plug it again
by enumerating it.

The special thing happens on the second case. For a PF, we could use the
original pci core to enumerate the bus, while for VF we need to record the
VFs which aer un-plugged then plug it again.

Also The patch caches the VF index in pci_dn, which can be used to
calculate VF's bus, device and function number. Those information helps to
locate the VF's PCI device instance when doing hotplug during EEH recovery
if necessary.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h|   7 ++
 arch/powerpc/include/asm/pci-bridge.h |   1 +
 arch/powerpc/kernel/eeh.c |   8 +++
 arch/powerpc/kernel/eeh_dev.c |   1 +
 arch/powerpc/kernel/eeh_driver.c  | 127 +++---
 arch/powerpc/kernel/eeh_pe.c  |   3 +-
 arch/powerpc/kernel/pci_dn.c  |   4 +-
 7 files changed, 123 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 331c856..1f68190 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -127,6 +127,11 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
 #define EEH_DEV_SYSFS  (1 << 9)/* Sysfs created*/
 #define EEH_DEV_REMOVED(1 << 10)   /* Removed permanently  
*/
 
+struct eeh_rmv_data {
+   struct list_head edev_list;
+   int removed;
+};
+
 struct eeh_dev {
int mode;   /* EEH mode */
int class_code; /* Class code of the device */
@@ -139,9 +144,11 @@ struct eeh_dev {
int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
+   struct list_head rmv_list;  /* Record the removed edev  */
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   bool   in_error;/* Error flag for eeh_dev   */
struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 9b365d6..533e6e9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -211,6 +211,7 @@ struct pci_dn {
 #define IODA_INVALID_PE(-1)
 #ifdef CONFIG_PPC_POWERNV
int pe_number;
+   int vf_index;   /* VF index in the PF */
 #ifdef CONFIG_PCI_IOV
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 41a4b30..0f36750 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1245,6 +1245,14 @@ void eeh_remove_device(struct pci_dev *dev)
 * from the parent PE during the BAR resotre.
 */
edev->pdev = NULL;
+
+   /*
+* The flag "in_error" is used to trace EEH devices for VFs
+* in error state or not. It's set in eeh_report_error(). If
+* it's not set, eeh_report_{reset,resume}() won't be called
+* for the VF EEH device.
+*/
+   edev->in_error = 0;
dev->dev.archdata.edev = NULL;
if (!(edev->pe->state & EEH_PE_KEEP))
eeh_rmv_from_parent_pe(edev);
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index aabba94..7815095 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -67,6 +67,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
edev->pdn = pdn;
edev->phb = phb;
INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>rmv_list);
 
return NULL;
 }
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 89eb4bc..06d20d6 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -211,6 +211,7 @@ static void *eeh_report_error(void *data, void *userdata)
if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
if (*res == PCI_ERS_RESULT_NONE) *res = rc;
 
+   edev->in_error = true;
eeh_pcid_put(dev);
return NULL;
 }
@@ -282,7 +283,8 @@ static void *eeh_report_reset(void *data, void *userdata)
 
if (!driver->err_handler ||
  

[PATCH V12 4/9] powerpc/eeh: Cache only BARs, not windows or IOV BARs

2015-11-03 Thread Wei Yang
This restricts the EEH address cache to use only the first 7 BARs. This
makes __eeh_addr_cache_insert_dev() ignore PCI bridge window and IOV BARs.
As the result of this change, eeh_addr_cache_get_dev() will return VFs from
VF's resource addresses instead of parent PFs.

This removes extra check for a PCI bridge as we limit
__eeh_addr_cache_insert_dev() to 7 BARs and this effectively excludes PCI
bridges from being cached.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_cache.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index a1e86e1..ddbcfab 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -195,8 +195,11 @@ static void __eeh_addr_cache_insert_dev(struct pci_dev 
*dev)
return;
}
 
-   /* Walk resources on this device, poke them into the tree */
-   for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
+   /*
+* Walk resources on this device, poke the first 7 (6 normal BAR and 1
+* ROM BAR) into the tree.
+*/
+   for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start = pci_resource_start(dev,i);
resource_size_t end = pci_resource_end(dev,i);
unsigned long flags = pci_resource_flags(dev,i);
@@ -222,10 +225,6 @@ void eeh_addr_cache_insert_dev(struct pci_dev *dev)
 {
unsigned long flags;
 
-   /* Ignore PCI bridges */
-   if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
-   return;
-
spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
__eeh_addr_cache_insert_dev(dev);
spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 8/9] powerpc/powernv: Support PCI config restore for VFs

2015-11-03 Thread Wei Yang
After PE reset, OPAL API opal_pci_reinit() is called on all devices
contained in the PE to reinitialize them. While skiboot is not aware of
VFs, we have to implement the function in kernel to reinitialize VFs after
reset on PE for VFs.

In this patch, two functions pnv_pci_fixup_vf_mps() and
pnv_eeh_restore_vf_config() both manipulate the MPS of the VF, since for a
VF it has three cases.

1. Normal creation for a VF
   In this case, pnv_pci_fixup_vf_mps() is called to make the MPS a proper
   value compared with its parent.
2. EEH recovery without VF removed
   In this case, MPS is stored in pci_dn and pnv_eeh_restore_vf_config() is
   called to restore it and reinitialize other part.
3. EEH recovery with VF removed
   In this case, VF will be removed then re-created. Both functions are
   called. First pnv_pci_fixup_vf_mps() is called to store the proper MPS
   to pci_dn and then pnv_eeh_restore_vf_config() is called to do proper
   thing.

This patch introduces two functions:
   pnv_pci_fixup_vf_mps() to fixup the PCI device's MPS to make sure it is
   smaller than parent's and store this value in pci_dn for future use.
   pnv_eeh_restore_vf_config() to re-initialize on VF by restore MPS,
   disable completion timeout, enable SERR, etc.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h|  1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c | 70 +++-
 arch/powerpc/platforms/powernv/pci.c | 18 +++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 843dd3a2..9b365d6 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -219,6 +219,7 @@ struct pci_dn {
 #define IODA_INVALID_M64(-1)
int (*m64_map)[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
+   int mps;/* Maximum Payload Size */
 #endif
struct list_head child_list;
struct list_head list;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 4de247a..9019458 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1623,6 +1623,67 @@ static int pnv_eeh_next_error(struct eeh_pe **pe)
return ret;
 }
 
+static int pnv_eeh_restore_vf_config(struct pci_dn *pdn)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 devctl, cmd, cap2, aer_capctl;
+   int old_mps;
+
+   /* Restore MPS */
+   if (edev->pcie_cap) {
+   old_mps = (ffs(pdn->mps) - 8) << 5;
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+2, );
+   devctl &= ~PCI_EXP_DEVCTL_PAYLOAD;
+   devctl |= old_mps;
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+ 2, devctl);
+   }
+
+   /* Disable Completion Timeout */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP2,
+4, );
+   if (cap2 & 0x10) {
+   eeh_ops->read_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, );
+   cap2 |= 0x10;
+   eeh_ops->write_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, cap2);
+   }
+   }
+
+   /* Enable SERR and parity checking */
+   eeh_ops->read_config(pdn, PCI_COMMAND, 2, );
+   cmd |= (PCI_COMMAND_PARITY | PCI_COMMAND_SERR);
+   eeh_ops->write_config(pdn, PCI_COMMAND, 2, cmd);
+
+   /* Enable report various errors */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, );
+   devctl &= ~PCI_EXP_DEVCTL_CERE;
+   devctl |= (PCI_EXP_DEVCTL_NFERE |
+  PCI_EXP_DEVCTL_FERE |
+  PCI_EXP_DEVCTL_URRE);
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, devctl);
+   }
+
+   /* Enable ECRC generation and check */
+   if (edev->pcie_cap && edev->aer_cap) {
+   eeh_ops->read_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, _capctl);
+   aer_capctl |= (PCI_ERR_CAP_ECRC_GENE | PCI_ERR_CAP_ECRC_CHKE);
+   eeh_ops->write_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, aer_capctl);
+   }
+
+   return 0;
+}
+
 static int pnv_eeh_restore_confi

[PATCH V12 6/9] powerpc/eeh: Create PE for VFs

2015-11-03 Thread Wei Yang
The patch creates PEs for VFs in the weak function
pcibios_bus_add_device(). Those PEs for VFs are identified with newly
introduced flag EEH_PE_VF so that we handle them differently during EEH
recovery.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |  1 +
 arch/powerpc/kernel/eeh_pe.c | 10 --
 arch/powerpc/platforms/powernv/eeh-powernv.c | 16 
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 6c383ad..ec21f8f 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct pci_dn;
 #define EEH_PE_PHB (1 << 1)/* PHB PE*/
 #define EEH_PE_DEVICE  (1 << 2)/* Device PE */
 #define EEH_PE_BUS (1 << 3)/* Bus PE*/
+#define EEH_PE_VF  (1 << 4)/* VF PE */
 
 #define EEH_PE_ISOLATED(1 << 0)/* Isolated PE  
*/
 #define EEH_PE_RECOVERING  (1 << 1)/* Recovering PE*/
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 8654cb1..29240ad 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev 
*edev)
 * EEH device already having associated PE, but
 * the direct parent EEH device doesn't have yet.
 */
-   pdn = pdn ? pdn->parent : NULL;
+   if (edev->physfn)
+   pdn = pci_get_pdn(edev->physfn);
+   else
+   pdn = pdn ? pdn->parent : NULL;
while (pdn) {
/* We're poking out of PCI territory */
parent = pdn_to_eeh_dev(pdn);
@@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
}
 
/* Create a new EEH PE */
-   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
+   if (edev->physfn)
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_VF);
+   else
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
if (!pe) {
pr_err("%s: out of memory!\n", __func__);
return -ENOMEM;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 3bb6acb..131c7d0 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1532,6 +1532,22 @@ static struct eeh_ops pnv_eeh_ops = {
.restore_config = pnv_eeh_restore_config
 };
 
+void pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+}
+
 /**
  * eeh_powernv_init - Register platform dependent EEH operations
  *
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 5/9] powerpc/powernv: EEH device for VF

2015-11-03 Thread Wei Yang
VFs and their corresponding pci_dn instances are created and released
dynamically as their PF's SRIOV capability is enabled and disabled.
The patch creates and releases EEH devices for VFs when creating and
releasing their pci_dn instances, which means EEH devices and pci_dn
instances have same life cycle. Also, VF's EEH device is identified
by (struct eeh_dev::physfn).

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  1 +
 arch/powerpc/kernel/pci_dn.c   | 13 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c5eb86f..6c383ad 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -140,6 +140,7 @@ struct eeh_dev {
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
 
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index b3b4df9..5091b05 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -178,7 +178,9 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
 struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 {
 #ifdef CONFIG_PCI_IOV
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pci_dn *parent, *pdn;
+   struct eeh_dev *edev;
int i;
 
/* Only support IOV for now */
@@ -204,6 +206,10 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 __func__, i);
return NULL;
}
+   eeh_dev_init(pdn, hose);
+   edev = pdn_to_eeh_dev(pdn);
+   BUG_ON(!edev);
+   edev->physfn = pdev;
}
 #endif /* CONFIG_PCI_IOV */
 
@@ -252,10 +258,17 @@ void remove_dev_pci_data(struct pci_dev *pdev)
for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
list_for_each_entry_safe(pdn, tmp,
>child_list, list) {
+   struct eeh_dev *edev;
if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
continue;
 
+   edev = pdn_to_eeh_dev(pdn);
+   if (edev) {
+   pdn->edev = NULL;
+   kfree(edev);
+   }
+
if (!list_empty(>list))
list_del(>list);
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 7/9] powerpc/powernv: Support EEH reset for VF PE

2015-11-03 Thread Wei Yang
PEs for VFs don't have primary bus. So they have to have their own reset
backend, which is used during EEH recovery. The patch implements the reset
backend for VF's PE by issuing FLR or AF FLR to the VFs, which are contained
in the PE.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |   1 +
 arch/powerpc/kernel/eeh.c|   9 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c | 133 ++-
 3 files changed, 139 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index ec21f8f..331c856 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -136,6 +136,7 @@ struct eeh_dev {
int pcix_cap;   /* Saved PCIx capability*/
int pcie_cap;   /* Saved PCIe capability*/
int aer_cap;/* Saved AER capability */
+   int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
struct pci_controller *phb; /* Associated PHB   */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index e968533..41a4b30 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -760,7 +760,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
case pcie_deassert_reset:
eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
eeh_unfreeze_pe(pe, false);
-   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
break;
@@ -768,14 +769,16 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
enum pcie_reset_state stat
eeh_pe_state_mark_with_cfg(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_HOT);
break;
case pcie_warm_reset:
eeh_pe_state_mark_with_cfg(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_FUNDAMENTAL);
break;
default:
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 131c7d0..4de247a 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -404,6 +404,7 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
edev->pcix_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_PCIX);
edev->pcie_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_EXP);
edev->aer_cap  = pnv_eeh_find_ecap(pdn, PCI_EXT_CAP_ID_ERR);
+   edev->af_cap   = pnv_eeh_find_cap(pdn, PCI_CAP_ID_AF);
if ((edev->class_code >> 8) == PCI_CLASS_BRIDGE_PCI) {
edev->mode |= EEH_DEV_BRIDGE;
if (edev->pcie_cap) {
@@ -893,6 +894,126 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int 
option)
return 0;
 }
 
+static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, int pos,
+u16 mask, const char *reset_type)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   int i, status = 0;
+
+   /* Wait for Transaction Pending bit to be cleared */
+   for (i = 0; i < 4; i++) {
+   eeh_ops->read_config(pdn, pos, 2, );
+   if (!(status & mask))
+   return;
+
+   msleep((1 << i) * 100);
+   }
+
+   pr_warn("%s: Pending transaction while issuing %s FLR to 
%04x:%02x:%02x.%01x\n",
+   __func__, reset_type,
+   edev->phb->global_number, pdn->busno,
+   PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
+}
+
+static int pnv_eeh_do_flr(struct pci_dn *pdn, int option)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 reg = 0;
+
+   if (WARN_ON(!edev->pcie_cap))
+   return -ENOTTY;
+
+   eeh_ops->read_config

[PATCH V12 2/9] PCI: Add pcibios_bus_add_device() weak function

2015-11-03 Thread Wei Yang
Add a weak function pcibios_bus_add_device() for arch dependent code could
do proper setup. For example, powerpc could setup EEH related resources.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/bus.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index d3346d2..2b8b756 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -269,6 +269,7 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx)
 
 void __weak pcibios_resource_survey_bus(struct pci_bus *bus) { }
 
+void __weak pcibios_bus_add_device(struct pci_dev *dev) { }
 /**
  * pci_bus_add_device - start driver for a single device
  * @dev: device to add
@@ -279,6 +280,8 @@ void pci_bus_add_device(struct pci_dev *dev)
 {
int retval;
 
+   pcibios_bus_add_device(dev);
+
/*
 * Can not put in pci_device_add yet because resources
 * are not assigned yet for some devices.
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V12 3/9] powerpc/pci: Remove VFs prior to PF

2015-11-03 Thread Wei Yang
As commit ac205b7bb72f ("PCI: make sriov work with hotplug remove")
indicates, VFs which is on the same PCI bus as their PF, should be removed
before the PF. Otherwise, the PCI hot unplugging of PCI devices on the PCI
bus would cause kernel crash.

The patch applies the above pattern to PowerPC PCI hotplug path.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index 7f9ed0c..59c4361 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -55,7 +55,7 @@ void pcibios_remove_pci_devices(struct pci_bus *bus)
 
pr_debug("PCI: Removing devices on bus %04x:%02x\n",
 pci_domain_nr(bus),  bus->number);
-   list_for_each_entry_safe(dev, tmp, >devices, bus_list) {
+   list_for_each_entry_safe_reverse(dev, tmp, >devices, bus_list) {
pr_debug("   Removing %s...\n", pci_name(dev));
pci_stop_and_remove_bus_device(dev);
}
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V12 0/9] VF EEH on Power8

2015-11-03 Thread Wei Yang
On Wed, Nov 04, 2015 at 04:05:37PM +1100, Alexey Kardashevskiy wrote:
>On 11/04/2015 02:28 PM, Wei Yang wrote:
>>This patchset enables EEH on SRIOV VFs. The general idea is to create proper
>>VF edev and VF PE and handle them properly.
>>
>>Different from the Bus PE, VF PE just contain one VF. This introduces the
>>difference of EEH error handling on a VF PE. Generally, it has several
>>differences.
>>
>>First, the VF's removal and re-enumerate rely on its PF. VF has a tight
>>relationship between its PF. This is not proper to enumerate a VF by usual
>>scan procedure. That's why virtfn_add/virtfn_remove are exported in this patch
>>set.
>>
>>Second, the reset/restore of a VF is done in kernel space. FW is not aware of
>>the VF, this means the usual reset function done in FW will not work. One of
>>the patch will imitate the reset/restore function in kernel space.
>>
>>Third, the VF may be removed during the PF's error_detected function. In this
>>case, the original error_detected->slot_reset->resume sequence is not proper
>>to those removed VFs, since they are re-created by PF in a fresh state. A flag
>>in eeh_dev is introduce to mark the eeh_dev is in error state. By doing so, we
>>track whether this device needs to be reset or not.
>>
>>This has been tested both on host and in guest on Power8 with latest kernel
>>version.
>
>With the small issues in 9/9 fixed,
>
>Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>

Thanks for your comment. I appreciate your time and efforts.

Have a good night.

>
>>
>>v12:
>>* rebased on v4.3
>>* Rephrase some commit log to make it more clear and specific
>>* move vf_index assignment in CONFIG_PPC_POWERNV
>>* merge "Cache VF index in pci_dn" with "Support error recovery for VF PE"
>>* check the return value after eeh_dev_init() for VF
>>* initialize the parameter before pass to read_config()
>>* make pnv_pci_fixup_vf_mps() a dedicated patch, which fixup and store mps
>>  value in pci_dn
>>v11:
>>* move vf_index assignment in marco CONFIG_PPC_POWERNV
>>* merge Patch "Cache VF index in pci_dn" into Patch "Support error 
>> recovery
>>  for VF PE"
>>v10:
>>* rebased on v4.2
>>* delete the last patch "powerpc/powernv: compound PE for VFs" since after
>>  redesign of SRIOV, there is no compound PE for VFs now.
>>* add two patches which fix problems found during tests
>>  powerpc/eeh: Support error recovery for VF PE
>>  powerpc/eeh: Handle hot removed VF when PF is EEH aware
>>v9:
>>* split pcibios_bus_add_device() into a separate patch
>>* Bjorn acked the PCI part and agreed this patch set to be merged from ppc
>>  tree
>>* rebased on mpe/linux.git next branch
>>v8:
>>* fix on checking the return value of pnv_eeh_do_flr()
>>* introduced a weak function pcibios_bus_add_device() to create PE for VFs
>>v7:
>>* fix compile error when PCI_IOV is not set
>>v6:
>>* code / commit log refactor by Gavin
>>v5:
>>* remove the compound field, iterate on Master VF PE instead
>>* some code refine on PCI config restore and reset on VF
>>  the wait time for assert and deassert
>>  PCI device address format
>>  check on edev->pcie_cap and edev->aer_cap before access them
>>v4:
>>* refine the change logs, comment and code style
>>* change pnv_pci_fixup_vf_eeh() to pnv_eeh_vf_final_fixup() and remove the
>>  CONFIG_PCI_IOV macro
>>* reorder patch 5/6 to make the logic more reasonable
>>* remove remove_dev_pci_data()
>>* remove the EEH_DEV_VF flag, use edev->physfn to identify a VF EEH DEV 
>> and
>>  remove related CONFIG_PCI_IOV macro
>>* add the option for VF reset
>>* fix the pnv_eeh_cfg_blocked() logic
>>* replace pnv_pci_cfg_{read,write} with eeh_ops->{read,write}_config in
>>  pnv_eeh_vf_restore_config()
>>* rename pnv_eeh_vf_restore_config() to pnv_eeh_restore_vf_config()
>>* rename pnv_pci_fixup_vf_caps() to pnv_pci_vf_header_fixup() and move it
>>  to arch/powerpc/platforms/powernv/pci.c
>>* add a field compound in pnv_ioda_pe to link compound PEs
>>* handle compound PE for VF PEs
>>v3:
>>* add back vf_index in pci_dn to track the VF's index
>>* rename ppdev in eeh_dev to physfn for consistency
>>* move edev->physfn assignment before dev->dev.archdata.edev is set
>>* mov

Re: [PATCH V12 9/9] powerpc/eeh: Support error recovery for VF PE

2015-11-03 Thread Wei Yang
On Wed, Nov 04, 2015 at 04:01:50PM +1100, Alexey Kardashevskiy wrote:
>On 11/04/2015 02:28 PM, Wei Yang wrote:
>>PFs are enumerated on PCI bus, while VFs are created by PF's driver.
>>
>>In EEH recovery, it has two cases:
>>1. Device and driver is EEH aware, error handlers are called.
>>2. Device and driver is not EEH aware, un-plug the device and plug it again
>>by enumerating it.
>>
>>The special thing happens on the second case. For a PF, we could use the
>>original pci core to enumerate the bus, while for VF we need to record the
>>VFs which aer un-plugged then plug it again.
>>
>>Also The patch caches the VF index in pci_dn, which can be used to
>>calculate VF's bus, device and function number. Those information helps to
>>locate the VF's PCI device instance when doing hotplug during EEH recovery
>>if necessary.
>>
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h|   7 ++
>>  arch/powerpc/include/asm/pci-bridge.h |   1 +
>>  arch/powerpc/kernel/eeh.c |   8 +++
>>  arch/powerpc/kernel/eeh_dev.c |   1 +
>>  arch/powerpc/kernel/eeh_driver.c  | 127 
>> +++---
>>  arch/powerpc/kernel/eeh_pe.c  |   3 +-
>>  arch/powerpc/kernel/pci_dn.c  |   4 +-
>>  7 files changed, 123 insertions(+), 28 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 331c856..1f68190 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -127,6 +127,11 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
>>  #define EEH_DEV_SYSFS   (1 << 9)/* Sysfs created
>> */
>>  #define EEH_DEV_REMOVED (1 << 10)   /* Removed permanently  
>> */
>>
>>+struct eeh_rmv_data {
>>+ struct list_head edev_list;
>>+ int removed;
>>+};
>
>
>This struct is only used in arch/powerpc/kernel/eeh_driver.c so move it there.
>

Will move this in next version.

>
>>+
>>  struct eeh_dev {
>>  int mode;   /* EEH mode */
>>  int class_code; /* Class code of the device */
>>@@ -139,9 +144,11 @@ struct eeh_dev {
>>  int af_cap; /* Saved AF capability  */
>>  struct eeh_pe *pe;  /* Associated PE*/
>>  struct list_head list;  /* Form link list in the PE */
>>+ struct list_head rmv_list;  /* Record the removed edev  */
>>  struct pci_controller *phb; /* Associated PHB   */
>>  struct pci_dn *pdn; /* Associated PCI device node   */
>>  struct pci_dev *pdev;   /* Associated PCI device*/
>>+ bool   in_error;/* Error flag for eeh_dev   */
>>  struct pci_dev *physfn; /* Associated PF PORT   */
>>  struct pci_bus *bus;/* PCI bus for partial hotplug  */
>>  };
>>diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>>b/arch/powerpc/include/asm/pci-bridge.h
>>index 9b365d6..533e6e9 100644
>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>@@ -211,6 +211,7 @@ struct pci_dn {
>>  #define IODA_INVALID_PE (-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  int pe_number;
>>+ int vf_index;   /* VF index in the PF */
>>  #ifdef CONFIG_PCI_IOV
>>  u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
>>  u16 num_vfs;/* number of VFs enabled*/
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 41a4b30..0f36750 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -1245,6 +1245,14 @@ void eeh_remove_device(struct pci_dev *dev)
>>   * from the parent PE during the BAR resotre.
>>   */
>>  edev->pdev = NULL;
>>+
>>+ /*
>>+  * The flag "in_error" is used to trace EEH devices for VFs
>>+  * in error state or not. It's set in eeh_report_error(). If
>>+  * it's not set, eeh_report_{reset,resume}() won't be called
>>+  * for the VF EEH device.
>>+  */
>>+ edev->in_error = 0;
>
>
>It is a bool, so "= false".
>

Correct.

>
>>  dev->dev.archdata.edev = NULL;
>>  if (!(edev->pe->state & EEH_PE_KEEP))
>>  eeh_rmv_from_parent_pe(edev);

Re: [PATCH V10 10/12] powerpc/eeh: Support error recovery for VF PE

2015-11-02 Thread Wei Yang
On Mon, Nov 02, 2015 at 10:40:36AM +1100, Alexey Kardashevskiy wrote:
>On 11/01/2015 12:53 PM, Wei Yang wrote:
>>On Fri, Oct 30, 2015 at 04:20:48PM +1100, Alexey Kardashevskiy wrote:
>>>On 10/26/2015 02:16 PM, Wei Yang wrote:
>>>>Different from PCI bus dependent PE, PE for VFs doesn't have the
>>>
>>>s/Different from/Unlike/
>>>
>>
>>Will change in next version.
>>
>>>
>>>>primary bus, on which the PCI hotplug is implemented. The patch
>>>>supports error recovery, especially the PCI hotplug for VF's PE.
>>>
>>>The patch adds support for error recovery of what exactly?
>>>What is "especially" about?
>>>
>>
>>PFs are enumerated on PCI bus, while VFs are created by PF's driver.
>>
>>In EEH recovery, it has two cases.
>>1. Device and driver is EEH aware, error handlers are called.
>>2. Device and driver is not EEH aware, un-plug the device and plug it again by
>>enumerating it.
>>
>>The special thing happens on the second case. For a PF, we could use the
>>original pci core to enumerate the bus, while for VF, we need to record the VF
>>which are un-plugged then plug it again.
>
>
>Right. This should have been the actual commit log.
>
>
>>>
>>>>The hotplug on VF's PE is implemented based on VFs, instead of
>>>>PCI bus any more.
>>>
>>>Needs rephrase.
>>>
>>>Is this patch about EEH error recovery, i.e. unplug VF, re-plug VF? Why does
>>>the commit log talk about PE hotplug? I thought we do VF (i.e. PCI device)
>>>hotplug, not PE.
>>>
>>
>>Hmm... unlike the Bus PE for PFs, VF PE is dynamically created and released
>>when VFs are created and released.
>
>
>Sure. PEs are created/released, not plugged/unplugged (VFs are), that was my
>point.
>

Thanks for the suggestion, will change it in next version.

>
>>
>>>
>>>>
>>>>[gwshan: changelog and code refactoring]
>>>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/include/asm/eeh.h   |   1 +
>>>>  arch/powerpc/kernel/eeh.c|   8 
>>>>  arch/powerpc/kernel/eeh_driver.c | 100 
>>>> +++
>>>>  arch/powerpc/kernel/eeh_pe.c |   3 +-
>>>>  4 files changed, 90 insertions(+), 22 deletions(-)
>>>>
>>>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>>>index 331c856..ea1f13c4 100644
>>>>--- a/arch/powerpc/include/asm/eeh.h
>>>>+++ b/arch/powerpc/include/asm/eeh.h
>>>>@@ -142,6 +142,7 @@ struct eeh_dev {
>>>>struct pci_controller *phb; /* Associated PHB   */
>>>>struct pci_dn *pdn; /* Associated PCI device node   */
>>>>struct pci_dev *pdev;   /* Associated PCI device*/
>>>>+   intin_error;/* Error flag for eeh_dev   */
>>>
>>>Make it "bool".
>>>
>>
>>Will change it in next version.
>>
>>>
>>>>struct pci_dev *physfn; /* Associated PF PORT   */
>>>>struct pci_bus *bus;/* PCI bus for partial hotplug  */
>>>>  };
>>>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>>>index af9b597..28e4d73 100644
>>>>--- a/arch/powerpc/kernel/eeh.c
>>>>+++ b/arch/powerpc/kernel/eeh.c
>>>>@@ -1227,6 +1227,14 @@ void eeh_remove_device(struct pci_dev *dev)
>>>> * from the parent PE during the BAR resotre.
>>>> */
>>>>edev->pdev = NULL;
>>>>+
>>>>+   /*
>>>>+* The flag "in_error" is used to trace EEH devices for VFs
>>>>+* in error state or not. It's set in eeh_report_error(). If
>>>>+* it's not set, eeh_report_{reset,resume}() won't be called
>>>>+* for the VF EEH device.
>>>>+*/
>>>>+   edev->in_error = 0;
>>>>dev->dev.archdata.edev = NULL;
>>>>if (!(edev->pe->state & EEH_PE_KEEP))
>>>>eeh_rmv_from_parent_pe(edev);
>>>>diff --git a/arch/powerpc/kernel/eeh_driver.c 
>>>>b/arch/powerpc/kernel/eeh_driver.c
>>>>index 89eb4bc..99868e2 100644
>>>>--- a/arch/powerpc/kernel/eeh_driver.c
>>>>+++ 

Re: [PATCH V10 08/12] powerpc/powernv: Support EEH reset for VF PE

2015-11-02 Thread Wei Yang
On Fri, Oct 30, 2015 at 07:05:05PM +1100, Alexey Kardashevskiy wrote:
>On 10/30/2015 06:18 PM, Wei Yang wrote:
>>On Fri, Oct 30, 2015 at 03:11:20PM +1100, Alexey Kardashevskiy wrote:
>>>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>>>PEs for VFs don't have primary bus. So they have to have their own reset
>>>>backend, which is used during EEH recovery. The patch implements the reset
>>>>backend for VF's PE by issuing FLR or AF FLR to the VFs, which are contained
>>>>in the PE.
>>>>
>>>>[gwshan: changelog and code refactoring]
>>>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/include/asm/eeh.h   |   1 +
>>>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 134 
>>>> ++-
>>>>  2 files changed, 134 insertions(+), 1 deletion(-)
>>>>
>>>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>>>index ec21f8f..331c856 100644
>>>>--- a/arch/powerpc/include/asm/eeh.h
>>>>+++ b/arch/powerpc/include/asm/eeh.h
>>>>@@ -136,6 +136,7 @@ struct eeh_dev {
>>>>int pcix_cap;   /* Saved PCIx capability*/
>>>>int pcie_cap;   /* Saved PCIe capability*/
>>>>int aer_cap;/* Saved AER capability */
>>>>+   int af_cap; /* Saved AF capability  */
>>>>struct eeh_pe *pe;  /* Associated PE*/
>>>>struct list_head list;  /* Form link list in the PE */
>>>>struct pci_controller *phb; /* Associated PHB   */
>>>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>>>>b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>index cfd55dd..017cd72 100644
>>>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>@@ -404,6 +404,7 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void 
>>>>*data)
>>>>edev->pcix_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_PCIX);
>>>>edev->pcie_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_EXP);
>>>>edev->aer_cap  = pnv_eeh_find_ecap(pdn, PCI_EXT_CAP_ID_ERR);
>>>>+   edev->af_cap   = pnv_eeh_find_cap(pdn, PCI_CAP_ID_AF);
>>>>if ((edev->class_code >> 8) == PCI_CLASS_BRIDGE_PCI) {
>>>>edev->mode |= EEH_DEV_BRIDGE;
>>>>if (edev->pcie_cap) {
>>>>@@ -893,6 +894,127 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, 
>>>>int option)
>>>>return 0;
>>>>  }
>>>>
>>>>+static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, int pos,
>>>>+u16 mask, bool af_flr_rst)
>
>Missed this - @af_flr_rst is only used for warnings so better do:
>s/bool af_flr_rst/const char *reset_type/
>to make it explicit.
>

Looks good, will change in next version.

>
>>>>+{
>>>>+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>>>+   int status, i;
>>>>+
>>>>+   /* Wait for Transaction Pending bit to be cleared */
>>>>+   for (i = 0; i < 4; i++) {
>>>>+   eeh_ops->read_config(pdn, pos, 2, );
>>>
>>>
>>>gcc should have complained on using uninitialized @status here.
>>>
>>
>>I remove the obj file and re-compile the file, not the warning.
>
>Hm. Does not warn me either.
>
>>And took a look at other places where read_config() is called. The laster
>>parameter is not initialized before called.
>
>So? It does not make it right.
>
>>You see the error during build?
>
>Why does it matter? We have an undefined behavior here which we should not.
>You could test the return values from read_config() but you do not so at
>least initialize local variables.
>

I believe your concern is reasonable.

I suggest to have a separate patch to fix the read_config() by initialize the
last parameter.

>
>>
>>>
>>>>+   if (!(status & mask))
>>>>+   return;
>>>>+
>>>>+   msleep((1 << i) * 100);
>>>>+   }
>>>>+
>>>>+   pr_warn("%s: Pending transaction while issuing %s FLR to "
>>>>+   "%04x:%02x:%02x.%01x\n",
>>>
>>

Re: [PATCH V10 10/12] powerpc/eeh: Support error recovery for VF PE

2015-10-31 Thread Wei Yang
On Fri, Oct 30, 2015 at 04:20:48PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:16 PM, Wei Yang wrote:
>>Different from PCI bus dependent PE, PE for VFs doesn't have the
>
>s/Different from/Unlike/
>

Will change in next version.

>
>>primary bus, on which the PCI hotplug is implemented. The patch
>>supports error recovery, especially the PCI hotplug for VF's PE.
>
>The patch adds support for error recovery of what exactly?
>What is "especially" about?
>

PFs are enumerated on PCI bus, while VFs are created by PF's driver.

In EEH recovery, it has two cases.
1. Device and driver is EEH aware, error handlers are called.
2. Device and driver is not EEH aware, un-plug the device and plug it again by
   enumerating it.

The special thing happens on the second case. For a PF, we could use the
original pci core to enumerate the bus, while for VF, we need to record the VF
which are un-plugged then plug it again.

>
>>The hotplug on VF's PE is implemented based on VFs, instead of
>>PCI bus any more.
>
>Needs rephrase.
>
>Is this patch about EEH error recovery, i.e. unplug VF, re-plug VF? Why does
>the commit log talk about PE hotplug? I thought we do VF (i.e. PCI device)
>hotplug, not PE.
>

Hmm... unlike the Bus PE for PFs, VF PE is dynamically created and released
when VFs are created and released.

>
>>
>>[gwshan: changelog and code refactoring]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h   |   1 +
>>  arch/powerpc/kernel/eeh.c|   8 
>>  arch/powerpc/kernel/eeh_driver.c | 100 
>> +++
>>  arch/powerpc/kernel/eeh_pe.c |   3 +-
>>  4 files changed, 90 insertions(+), 22 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 331c856..ea1f13c4 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -142,6 +142,7 @@ struct eeh_dev {
>>  struct pci_controller *phb; /* Associated PHB   */
>>  struct pci_dn *pdn; /* Associated PCI device node   */
>>  struct pci_dev *pdev;   /* Associated PCI device*/
>>+ intin_error;/* Error flag for eeh_dev   */
>
>Make it "bool".
>

Will change it in next version.

>
>>  struct pci_dev *physfn; /* Associated PF PORT   */
>>  struct pci_bus *bus;/* PCI bus for partial hotplug  */
>>  };
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index af9b597..28e4d73 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -1227,6 +1227,14 @@ void eeh_remove_device(struct pci_dev *dev)
>>   * from the parent PE during the BAR resotre.
>>   */
>>  edev->pdev = NULL;
>>+
>>+ /*
>>+  * The flag "in_error" is used to trace EEH devices for VFs
>>+  * in error state or not. It's set in eeh_report_error(). If
>>+  * it's not set, eeh_report_{reset,resume}() won't be called
>>+  * for the VF EEH device.
>>+  */
>>+ edev->in_error = 0;
>>  dev->dev.archdata.edev = NULL;
>>  if (!(edev->pe->state & EEH_PE_KEEP))
>>  eeh_rmv_from_parent_pe(edev);
>>diff --git a/arch/powerpc/kernel/eeh_driver.c 
>>b/arch/powerpc/kernel/eeh_driver.c
>>index 89eb4bc..99868e2 100644
>>--- a/arch/powerpc/kernel/eeh_driver.c
>>+++ b/arch/powerpc/kernel/eeh_driver.c
>>@@ -211,6 +211,7 @@ static void *eeh_report_error(void *data, void *userdata)
>>  if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
>>  if (*res == PCI_ERS_RESULT_NONE) *res = rc;
>>
>>+ edev->in_error = 1;
>>  eeh_pcid_put(dev);
>>  return NULL;
>>  }
>>@@ -282,7 +283,8 @@ static void *eeh_report_reset(void *data, void *userdata)
>>
>>  if (!driver->err_handler ||
>>  !driver->err_handler->slot_reset ||
>>- (edev->mode & EEH_DEV_NO_HANDLER)) {
>>+ (edev->mode & EEH_DEV_NO_HANDLER) ||
>>+ (!edev->in_error)) {
>>  eeh_pcid_put(dev);
>>  return NULL;
>>  }
>>@@ -339,14 +341,16 @@ static void *eeh_report_resume(void *data, void 
>>*userdata)
>>
>
>bood was_in_error = edev->in_error;
>edev->in_error = false;
>
>then use was_in_error below and there is no need to replace return with goto,
>etc -> slight

Re: [PATCH V10 11/12] powerpc/eeh: Don't block PCI config on resetting VF PE

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 04:42:07PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:16 PM, Wei Yang wrote:
>>From: Gavin Shan <gws...@linux.vnet.ibm.com>
>>
>>When passing through SRIOV VF from host to guest via VFIO PCI
>>infrastructure, the VF is resetted by EEH specific backend
>>(pcibios_set_pcie_reset_state()). We can't block the PCI config,
>>otherwise, the reset (FLR or AF FLR), to be completed by PCI
>>config access to the VF, can't be done. Then the VF can't be
>>put into initial state when passing it to the guest and returning
>>back to the host.
>>
>>The patch just doesn't block the VF's PCI config space when doing
>>the reset. It fixes EEH error caused by DMA traffic to bogus DMA
>>address on restarting guest after killing the QEMU process, which
>>includes Mellanox VF passed through from host.
>
>The patch as it is makes sense as a bugfix for our internal tree where the
>EEH VF feature was present at the time when this patch was posted but in this
>patchset is makes more sense to merge it into:
>
>[PATCH V10 08/12] powerpc/powernv: Support EEH reset for VF PE
>
>as it is quite weird within one patchset to introduce a problem  and then fix
>it in a following patch.
>

Sure, got it.

>
>>Reported-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>Tested-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>
>Remove "sob: aik@..." please.
>
>
>>---
>>  arch/powerpc/kernel/eeh.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 28e4d73..e1846f5 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -745,7 +745,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
>>enum pcie_reset_state stat
>>  case pcie_deassert_reset:
>>  eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
>>  eeh_unfreeze_pe(pe, false);
>>- eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
>>+ if (!(pe->type & EEH_PE_VF))
>>+ eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
>>  eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
>>  eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
>>  break;
>>@@ -753,14 +754,16 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
>>enum pcie_reset_state stat
>>  eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
>>  eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
>>  eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
>>- eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
>>+ if (!(pe->type & EEH_PE_VF))
>>+ eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
>>  eeh_ops->reset(pe, EEH_RESET_HOT);
>>  break;
>>  case pcie_warm_reset:
>>  eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
>>  eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
>>  eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
>>- eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
>>+ if (!(pe->type & EEH_PE_VF))
>>+ eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
>>  eeh_ops->reset(pe, EEH_RESET_FUNDAMENTAL);
>>  break;
>>  default:
>>
>
>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 08/12] powerpc/powernv: Support EEH reset for VF PE

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 03:11:20PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>PEs for VFs don't have primary bus. So they have to have their own reset
>>backend, which is used during EEH recovery. The patch implements the reset
>>backend for VF's PE by issuing FLR or AF FLR to the VFs, which are contained
>>in the PE.
>>
>>[gwshan: changelog and code refactoring]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h   |   1 +
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 134 
>> ++-
>>  2 files changed, 134 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index ec21f8f..331c856 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -136,6 +136,7 @@ struct eeh_dev {
>>  int pcix_cap;   /* Saved PCIx capability*/
>>  int pcie_cap;   /* Saved PCIe capability*/
>>  int aer_cap;/* Saved AER capability */
>>+ int af_cap; /* Saved AF capability  */
>>  struct eeh_pe *pe;  /* Associated PE*/
>>  struct list_head list;  /* Form link list in the PE */
>>  struct pci_controller *phb; /* Associated PHB   */
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>>b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index cfd55dd..017cd72 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -404,6 +404,7 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
>>  edev->pcix_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_PCIX);
>>  edev->pcie_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_EXP);
>>  edev->aer_cap  = pnv_eeh_find_ecap(pdn, PCI_EXT_CAP_ID_ERR);
>>+ edev->af_cap   = pnv_eeh_find_cap(pdn, PCI_CAP_ID_AF);
>>  if ((edev->class_code >> 8) == PCI_CLASS_BRIDGE_PCI) {
>>  edev->mode |= EEH_DEV_BRIDGE;
>>  if (edev->pcie_cap) {
>>@@ -893,6 +894,127 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, 
>>int option)
>>  return 0;
>>  }
>>
>>+static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, int pos,
>>+  u16 mask, bool af_flr_rst)
>>+{
>>+ struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>+ int status, i;
>>+
>>+ /* Wait for Transaction Pending bit to be cleared */
>>+ for (i = 0; i < 4; i++) {
>>+ eeh_ops->read_config(pdn, pos, 2, );
>
>
>gcc should have complained on using uninitialized @status here.
>

I remove the obj file and re-compile the file, not the warning.
And took a look at other places where read_config() is called. The laster
parameter is not initialized before called.

You see the error during build?

>
>>+ if (!(status & mask))
>>+ return;
>>+
>>+ msleep((1 << i) * 100);
>>+ }
>>+
>>+ pr_warn("%s: Pending transaction while issuing %s FLR to "
>>+ "%04x:%02x:%02x.%01x\n",
>
>Do not wrap user-visible strings.
>

Will change this.

>
>>+ __func__, af_flr_rst ? "AF" : "",
>>+ edev->phb->global_number, pdn->busno,
>>+ PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
>>+}
>>+
>>+static int pnv_eeh_do_flr(struct pci_dn *pdn, int option)
>>+{
>>+ struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>+ u32 reg;
>>+
>>+ if (!edev->pcie_cap)
>>+ return -ENOTTY;
>
>
>Can pnv_eeh_do_flr() be really called on a non PCIe device, can we get that
>far? WARN_ON_ONCE() may be?
>

So you suggest to add a WARN_ON_ONCE() in this condition, right?

>
>>+
>>+ eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP, 4, );
>
>
>... and here about uninitialized @reg.
>
>
>>+ if (!(reg & PCI_EXP_DEVCAP_FLR))
>>+ return -ENOTTY;
>>+
>>+ switch (option) {
>>+ case EEH_RESET_HOT:
>>+ case EEH_RESET_FUNDAMENTAL:
>>+ pnv_eeh_wait_for_pending(pdn, edev->pcie_cap + PCI_EXP_DEVSTA,
>>+  PCI_EXP_DEVSTA_TRPND, false);
>>+ eeh_ops->read_

Re: [PATCH V10 06/12] powerpc/powernv: EEH device for VF

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 02:33:49PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>VFs and their corresponding pci_dn instances are created and released
>>dynamically as their PF's SRIOV capability is enabled and disabled.
>>The patch creates and releases EEH devices for VFs when creating and
>>releasing their pci_dn instances, which means EEH devices and pci_dn
>>instances have same life cycle. Also, VF's EEH device is identified
>>by (struct eeh_dev::physfn).
>
>
>The add_dev_pci_data() helper (the one you hack) does not create pci_dn
>instances. The add_one_dev_pci_data() helper does.
>

Yes, you are right. The patch here create edev after the pci_dn is created.

So which part in the log you think is not accurate?

>
>>
>>[gwshan: changelog and removed CONFIG_PCI_IOV]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h |  1 +
>>  arch/powerpc/kernel/pci_dn.c   | 12 
>>  2 files changed, 13 insertions(+)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index c5eb86f..6c383ad 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -140,6 +140,7 @@ struct eeh_dev {
>>  struct pci_controller *phb; /* Associated PHB   */
>>  struct pci_dn *pdn; /* Associated PCI device node   */
>>  struct pci_dev *pdev;   /* Associated PCI device*/
>>+ struct pci_dev *physfn; /* Associated PF PORT   */
>>  struct pci_bus *bus;/* PCI bus for partial hotplug  */
>>  };
>>
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index f771130..f0ddde7 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -180,7 +180,9 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
>>*parent,
>>  struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>>  {
>>  #ifdef CONFIG_PCI_IOV
>>+ struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>  struct pci_dn *parent, *pdn;
>>+ struct eeh_dev *edev;
>>  int i;
>>
>>  /* Only support IOV for now */
>>@@ -206,6 +208,9 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>>   __func__, i);
>>  return NULL;
>>  }
>>+ eeh_dev_init(pdn, hose);
>>+ edev = pdn_to_eeh_dev(pdn);
>
>
>In theory, pdn_to_eeh_dev() can return NULL. In this patch, it is not clear
>if pdn->edev gets initialized before or after add_dev_pci_data().
>

Yep, the return value should be checked.

pdn->edev is initialized in eeh_dev_init() which is called in
add_dev_pci_data(). The order is not clear?

>
>
>>+ edev->physfn = pdev;
>>  }
>>  #endif /* CONFIG_PCI_IOV */
>>
>>@@ -254,10 +259,17 @@ void remove_dev_pci_data(struct pci_dev *pdev)
>>  for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>>  list_for_each_entry_safe(pdn, tmp,
>>  >child_list, list) {
>>+ struct eeh_dev *edev;
>>  if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
>>  pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
>>  continue;
>>
>>+ edev = pdn_to_eeh_dev(pdn);
>>+ if (edev) {
>>+ pdn->edev = NULL;
>>+ kfree(edev);
>>+ }
>>+
>>  if (!list_empty(>list))
>>  list_del(>list);
>>
>>
>
>
>-- 
>Alexey
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 12/12] powerpc/eeh: Handle hot removed VF when PF is EEH aware

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 04:35:54PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:16 PM, Wei Yang wrote:
>>When PF is EEH aware while VFs are not, VFs will be removed during EEH
>>recovery. This is not supported in current code, while will leads to the VF
>>lost.
>>
>>This patch fixes this by adding VFs back. VFs should be added back after PF
>>get recovered properly.
>>
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>
>btw please remove my "sob" from this patchset (here and 11/12 at least) as I
>did not "sob" the upstream versions of these and I did not post them and
>there is no public tree of mine with these patches. When I agree that the
>patches are good to go, it will be "reviewed-by" or "acked-by".
>

Sure, I would obey this rule in the future.

>
>>---
>>  arch/powerpc/include/asm/eeh.h   |  6 ++
>>  arch/powerpc/kernel/eeh_dev.c|  1 +
>>  arch/powerpc/kernel/eeh_driver.c | 30 +++---
>>  3 files changed, 30 insertions(+), 7 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index ea1f13c4..c529a23 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -127,6 +127,11 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
>>  #define EEH_DEV_SYSFS   (1 << 9)/* Sysfs created
>> */
>>  #define EEH_DEV_REMOVED (1 << 10)   /* Removed permanently  
>> */
>>
>>+struct eeh_rmv_data {
>>+ struct list_head edev_list;
>>+ int removed;
>>+};
>>+
>>  struct eeh_dev {
>>  int mode;   /* EEH mode */
>>  int class_code; /* Class code of the device */
>>@@ -139,6 +144,7 @@ struct eeh_dev {
>>  int af_cap; /* Saved AF capability  */
>>  struct eeh_pe *pe;  /* Associated PE*/
>>  struct list_head list;  /* Form link list in the PE */
>>+ struct list_head rmv_list;  /* Record the removed edev  */
>>  struct pci_controller *phb; /* Associated PHB   */
>>  struct pci_dn *pdn; /* Associated PCI device node   */
>>  struct pci_dev *pdev;   /* Associated PCI device*/
>>diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>>index aabba94..7815095 100644
>>--- a/arch/powerpc/kernel/eeh_dev.c
>>+++ b/arch/powerpc/kernel/eeh_dev.c
>>@@ -67,6 +67,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>  edev->pdn = pdn;
>>  edev->phb = phb;
>>  INIT_LIST_HEAD(>list);
>>+ INIT_LIST_HEAD(>rmv_list);
>>
>>  return NULL;
>>  }
>>diff --git a/arch/powerpc/kernel/eeh_driver.c 
>>b/arch/powerpc/kernel/eeh_driver.c
>>index 99868e2..f2406b4 100644
>>--- a/arch/powerpc/kernel/eeh_driver.c
>>+++ b/arch/powerpc/kernel/eeh_driver.c
>>@@ -420,7 +420,8 @@ static void *eeh_rmv_device(void *data, void *userdata)
>>  struct pci_driver *driver;
>>  struct eeh_dev *edev = (struct eeh_dev *)data;
>>  struct pci_dev *dev = eeh_dev_to_pci_dev(edev);
>>- int *removed = (int *)userdata;
>>+ struct eeh_rmv_data *rmv_data = (struct eeh_rmv_data *)userdata;
>>+ int *removed = rmv_data ? _data->removed : NULL;
>
>
>You just touched @userdata/@removed in [10/12] and now you are touching it
>again.
>
>It feels like this patch is better to be merged into [10/12], this will
>reduce the noise about the userdata pointer changes passed into
>eeh_pe_dev_traverse() and make more sense as "powerpc/eeh: Support error
>recovery for VF PE" without adding VFs back is pretty useless.
>

Agree, will merge them.

>
>
>
>>  struct pci_dn *pdn = eeh_dev_to_pdn(edev);
>>
>>  /*
>>@@ -467,6 +468,9 @@ static void *eeh_rmv_device(void *data, void *userdata)
>>   * required to plug the VF successfully.
>>   */
>>  pdn->pe_number = IODA_INVALID_PE;
>>+
>>+ if (rmv_data)
>>+ list_add(>rmv_list, _data->edev_list);
>>  } else {
>>  pci_lock_rescan_remove();
>>  pci_stop_and_remove_bus_device(dev);
>>@@ -585,11 +589,12 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>   * During the reset, udev might be invoked because those affected
>>

Re: [PATCH V10 04/12] powerpc/pci: Remove VFs prior to PF

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 02:04:12PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>As commit ac205b7bb72f ("PCI: make sriov work with hotplug remove") indicates,
>>VFs, which might be hooked to same PCI bus as their PF should be removed
>
>A comma is missing before "should be" (or you did not need a comma after
>"VFs" may be :) ).
>

I think you are right.

>
>>before the PF. Otherwise, the PCI hot unplugging on the PCI bus would
>
>s/on/of/? "Unplugging on" does not make much sense to me in this context at
>least.
>

Sounds I need to improve my English :-)

"on" here means those PCI devices are attached to the PCI bus. So "of" is the
correct word?

Change "unplugging" to "removing" would be better?

>
>>cause kernel crash.
>>
>>The patch applies the above pattern to PowerPC PCI hotplug path.
>>
>>[gwshan: changelog]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/kernel/pci-hotplug.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/kernel/pci-hotplug.c 
>>b/arch/powerpc/kernel/pci-hotplug.c
>>index 7f9ed0c..59c4361 100644
>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>@@ -55,7 +55,7 @@ void pcibios_remove_pci_devices(struct pci_bus *bus)
>>
>>  pr_debug("PCI: Removing devices on bus %04x:%02x\n",
>>   pci_domain_nr(bus),  bus->number);
>>- list_for_each_entry_safe(dev, tmp, >devices, bus_list) {
>>+ list_for_each_entry_safe_reverse(dev, tmp, >devices, bus_list) {
>>  pr_debug("   Removing %s...\n", pci_name(dev));
>>  pci_stop_and_remove_bus_device(dev);
>>  }
>>
>
>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 06/12] powerpc/powernv: EEH device for VF

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 06:36:01PM +1100, Alexey Kardashevskiy wrote:
>On 10/30/2015 05:52 PM, Wei Yang wrote:
>>On Fri, Oct 30, 2015 at 02:33:49PM +1100, Alexey Kardashevskiy wrote:
>>>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>>>VFs and their corresponding pci_dn instances are created and released
>>>>dynamically as their PF's SRIOV capability is enabled and disabled.
>>>>The patch creates and releases EEH devices for VFs when creating and
>>>>releasing their pci_dn instances, which means EEH devices and pci_dn
>>>>instances have same life cycle. Also, VF's EEH device is identified
>>>>by (struct eeh_dev::physfn).
>>>
>>>
>>>The add_dev_pci_data() helper (the one you hack) does not create pci_dn
>>>instances. The add_one_dev_pci_data() helper does.
>>>
>>
>>Yes, you are right. The patch here create edev after the pci_dn is created.
>>
>>So which part in the log you think is not accurate?
>
>
>The commit log is ok, I just thought loud that eeh_dev_init() could go to
>add_one_dev_pci_data() to make things more clear.
>

I thought this is are good suggestion.

My thought is, when we don't have VF, pci_dn and edev are two different thing.
We create pci_dn first and then initialize the edev. So mix the initialization
of them together is not that clear.

Not sure you agree or not.

>
>
>>>
>>>>
>>>>[gwshan: changelog and removed CONFIG_PCI_IOV]
>>>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/include/asm/eeh.h |  1 +
>>>>  arch/powerpc/kernel/pci_dn.c   | 12 
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>>>index c5eb86f..6c383ad 100644
>>>>--- a/arch/powerpc/include/asm/eeh.h
>>>>+++ b/arch/powerpc/include/asm/eeh.h
>>>>@@ -140,6 +140,7 @@ struct eeh_dev {
>>>>struct pci_controller *phb; /* Associated PHB   */
>>>>struct pci_dn *pdn; /* Associated PCI device node   */
>>>>struct pci_dev *pdev;   /* Associated PCI device*/
>>>>+   struct pci_dev *physfn; /* Associated PF PORT   */
>>>>struct pci_bus *bus;/* PCI bus for partial hotplug  */
>>>>  };
>>>>
>>>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>>>index f771130..f0ddde7 100644
>>>>--- a/arch/powerpc/kernel/pci_dn.c
>>>>+++ b/arch/powerpc/kernel/pci_dn.c
>>>>@@ -180,7 +180,9 @@ static struct pci_dn *add_one_dev_pci_data(struct 
>>>>pci_dn *parent,
>>>>  struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>>>>  {
>>>>  #ifdef CONFIG_PCI_IOV
>>>>+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>struct pci_dn *parent, *pdn;
>>>>+   struct eeh_dev *edev;
>>>>int i;
>>>>
>>>>/* Only support IOV for now */
>>>>@@ -206,6 +208,9 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>>>> __func__, i);
>>>>return NULL;
>>>>}
>>>>+   eeh_dev_init(pdn, hose);
>>>>+   edev = pdn_to_eeh_dev(pdn);
>>>
>>>
>>>In theory, pdn_to_eeh_dev() can return NULL. In this patch, it is not clear
>>>if pdn->edev gets initialized before or after add_dev_pci_data().
>>>
>>
>>Yep, the return value should be checked.
>
>May be BUG_ON will be enough, up to you.
>

Yep, thanks.

>
>>
>>pdn->edev is initialized in eeh_dev_init() which is called in
>>add_dev_pci_data(). The order is not clear?
>>
>>>
>>>
>>>>+   edev->physfn = pdev;
>>>>}
>>>>  #endif /* CONFIG_PCI_IOV */
>>>>
>>>>@@ -254,10 +259,17 @@ void remove_dev_pci_data(struct pci_dev *pdev)
>>>>for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>>>>list_for_each_entry_safe(pdn, tmp,
>>>>>child_list, list) {
>>>>+   struct eeh_dev *edev;
>>>>if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
>>>>pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
>>>>continue;
>>>>
>>>>+   edev = pdn_to_eeh_dev(pdn);
>>>>+   if (edev) {
>>>>+   pdn->edev = NULL;
>>>>+   kfree(edev);
>>>>+   }
>>>>+
>>>>if (!list_empty(>list))
>>>>list_del(>list);
>>>>
>>>>
>>>
>
>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 07/12] powerpc/eeh: Create PE for VFs

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 02:46:35PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>Current EEH recovery code works with the assumption: the PE has primary
>>bus. Unfortunately, that's not true for VF PEs, which generally contains
>>one or multiple VFs (for VF group case).
>
>What is that "VF group case"? Is not it a "compound PE" thingy which you were
>removing in "SRIOV redesign patchset"?
>

I think you are right.

The commit log is not correct, especially after SRIOV redesign.
Will remove this part.

>The patch might be ok but the commit log above does not explain why the
>existing way of PEs allocation would not work - somehow it works for a
>primary bus now, why would not it work on other buses?
>
>
>>The patch creates PEs for VFs in the weak function
>>pcibios_bus_add_device().Those PEs for VFs are identified with newly
>>introduced flag EEH_PE_VF so that we handle them differently during EEH
>>recovery.
>>
>>[gwshan: changelog and code refactoring]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h   |  1 +
>>  arch/powerpc/kernel/eeh_pe.c | 10 --
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 16 
>>  3 files changed, 25 insertions(+), 2 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 6c383ad..ec21f8f 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -72,6 +72,7 @@ struct pci_dn;
>>  #define EEH_PE_PHB  (1 << 1)/* PHB PE*/
>>  #define EEH_PE_DEVICE   (1 << 2)/* Device PE */
>>  #define EEH_PE_BUS  (1 << 3)/* Bus PE*/
>>+#define EEH_PE_VF(1 << 4)/* VF PE */
>>
>>  #define EEH_PE_ISOLATED (1 << 0)/* Isolated PE  
>> */
>>  #define EEH_PE_RECOVERING   (1 << 1)/* Recovering PE*/
>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>>index 35f0b62..260a701 100644
>>--- a/arch/powerpc/kernel/eeh_pe.c
>>+++ b/arch/powerpc/kernel/eeh_pe.c
>>@@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev 
>>*edev)
>>   * EEH device already having associated PE, but
>>   * the direct parent EEH device doesn't have yet.
>>   */
>>- pdn = pdn ? pdn->parent : NULL;
>>+ if (edev->physfn)
>>+ pdn = pci_get_pdn(edev->physfn);
>>+ else
>>+ pdn = pdn ? pdn->parent : NULL;
>>  while (pdn) {
>>  /* We're poking out of PCI territory */
>>  parent = pdn_to_eeh_dev(pdn);
>>@@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
>>  }
>>
>>  /* Create a new EEH PE */
>>- pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
>>+ if (edev->physfn)
>>+ pe = eeh_pe_alloc(edev->phb, EEH_PE_VF);
>>+ else
>>+ pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
>>  if (!pe) {
>>  pr_err("%s: out of memory!\n", __func__);
>>  return -ENOMEM;
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>>b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 7cf0df8..cfd55dd 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -1524,6 +1524,22 @@ static struct eeh_ops pnv_eeh_ops = {
>>  .restore_config = pnv_eeh_restore_config
>>  };
>>
>>+void pcibios_bus_add_device(struct pci_dev *pdev)
>>+{
>>+ struct pci_dn *pdn = pci_get_pdn(pdev);
>>+
>>+ if (!pdev->is_virtfn)
>>+ return;
>>+
>>+ /*
>>+  * The following operations will fail if VF's sysfs files
>>+  * aren't created or its resources aren't finalized.
>>+  */
>>+ eeh_add_device_early(pdn);
>>+ eeh_add_device_late(pdev);
>>+ eeh_sysfs_add_device(pdev);
>>+}
>>+
>>  /**
>>   * eeh_powernv_init - Register platform dependent EEH operations
>>   *
>>
>
>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 09/12] powerpc/powernv: Support PCI config restore for VFs

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 03:56:12PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>After PE reset, OPAL API opal_pci_reinit() is called on all devices
>>contained in the PE to reinitialize them. However, VFs can't be seen
>>from skiboot firmware. We have to implement the functions, similar
>>those in skiboot firmware, to reinitialize VFs after reset on PE
>>for VFs.
>>
>>[gwshan: changelog and code refactoring]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/pci-bridge.h|  1 +
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 70 
>> +++-
>>  arch/powerpc/platforms/powernv/pci.c | 18 +++
>>  3 files changed, 88 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>>b/arch/powerpc/include/asm/pci-bridge.h
>>index 3d7e537..e499d93 100644
>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>@@ -219,6 +219,7 @@ struct pci_dn {
>>  #define IODA_INVALID_M64(-1)
>>  int (*m64_map)[PCI_SRIOV_NUM_BARS];
>>  #endif /* CONFIG_PCI_IOV */
>>+ int mps;
>
>int mps; /* maximum payload size */
>?

You are right. Will add this comment in code.

>
>
>>  #endif
>>  struct list_head child_list;
>>  struct list_head list;
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>>b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 017cd72..3cc3e76 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -1616,6 +1616,67 @@ static int pnv_eeh_next_error(struct eeh_pe **pe)
>>  return ret;
>>  }
>>
>>+static int pnv_eeh_restore_vf_config(struct pci_dn *pdn)
>
>It does not exactly restore it, it just tweaks few bytes.
>
>
>>+{
>>+ struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>+ u32 devctl, cmd, cap2, aer_capctl;
>>+ int old_mps;
>>+
>>+ /* Restore MPS */
>>+ if (edev->pcie_cap) {
>>+ old_mps = (ffs(pdn->mps) - 8) << 5;
>>+ eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
>>+  2, );
>>+ devctl &= ~PCI_EXP_DEVCTL_PAYLOAD;
>>+ devctl |= old_mps;
>>+ eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
>>+   2, devctl);
>>+ }
>>+
>>+ /* Disable Completion Timeout */
>>+ if (edev->pcie_cap) {
>>+ eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP2,
>>+  4, );
>>+ if (cap2 & 0x10) {
>>+ eeh_ops->read_config(pdn,
>>+ edev->pcie_cap + PCI_EXP_DEVCTL2,
>>+ 4, );
>>+ cap2 |= 0x10;
>>+ eeh_ops->write_config(pdn,
>>+ edev->pcie_cap + PCI_EXP_DEVCTL2,
>>+ 4, cap2);
>>+ }
>>+ }
>>+
>>+ /* Enable SERR and parity checking */
>>+ eeh_ops->read_config(pdn, PCI_COMMAND, 2, );
>
>
>No complains from gcc about uninitialized @cmd and others? Cl...
>

No...

>
>>+ cmd |= (PCI_COMMAND_PARITY | PCI_COMMAND_SERR);
>>+ eeh_ops->write_config(pdn, PCI_COMMAND, 2, cmd);
>>+
>>+ /* Enable report various errors */
>>+ if (edev->pcie_cap) {
>>+ eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
>>+ 2, );
>>+ devctl &= ~PCI_EXP_DEVCTL_CERE;
>>+ devctl |= (PCI_EXP_DEVCTL_NFERE |
>>+PCI_EXP_DEVCTL_FERE |
>>+PCI_EXP_DEVCTL_URRE);
>>+ eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
>>+ 2, devctl);
>>+ }
>>+
>>+ /* Enable ECRC generation and check */
>>+ if (edev->pcie_cap && edev->aer_cap) {
>>+ eeh_ops->read_config(pdn, edev->aer_cap + PCI_ERR_CAP,
>>+ 4, _capctl);
>>+ aer_capctl |= (PCI_ERR_CAP_ECRC_GENE | PCI_ERR_CAP_ECRC_CHKE);
>>+ eeh_ops->write_config(pdn, edev->aer_cap + PCI_ERR_CAP,
>>

Re: [PATCH V10 05/12] powerpc/eeh: Cache only BARs, not windows or IOV BARs

2015-10-30 Thread Wei Yang
On Fri, Oct 30, 2015 at 02:22:43PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>EEH address cache, which helps to locate the PCI device according to
>>the given (physical) MMIO address, didn't cover PCI bridges. Also, it
>>shouldn't return PF
>
>"it shouldn't return" is about the cache, right? eeh_addr_cache_get_dev() -
>this guy can "return", the cache cannot.
>

Here I want to say if we cache the PF's IOV BAR, eeh_addr_cache_get_dev()
would return PF when the address is for VF.

>>with address in PF's IOV BARs. Instead, the VFs
>>should be returned.
>>
>>Also, by doing so, it removes the type check in
>>eeh_addr_cache_insert_dev(), since bridge's window would not be cached.
>>
>>The patch restricts the address cache to cover first 7 BARs for the
>>above purposes.
>
>
>I'd better understand something like this :)
>
>This restricts the EEH address cache to use only first 7 BARs. This makes
>__eeh_addr_cache_insert_dev() ignore PCI bridge windows and IOV BARs. As the
>result of this change, eeh_addr_cache_get_dev() will return VFs from VF's
>resource addresses instead of parent PFs.
>
>This removes extra check for a PCI bridge as we limit
>__eeh_addr_cache_insert_dev() to 7 BARs and this effectively excludes PCI
>bridges from being cached.
>

Yep, I think this one is more clear. Would use this one.

>
>>
>>[gwshan: changelog]
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/kernel/eeh_cache.c | 6 +-
>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>
>>diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
>>index a1e86e1..e6887f0 100644
>>--- a/arch/powerpc/kernel/eeh_cache.c
>>+++ b/arch/powerpc/kernel/eeh_cache.c
>>@@ -196,7 +196,7 @@ static void __eeh_addr_cache_insert_dev(struct pci_dev 
>>*dev)
>>  }
>>
>>  /* Walk resources on this device, poke them into the tree */
>>- for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>>+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>>  resource_size_t start = pci_resource_start(dev,i);
>>  resource_size_t end = pci_resource_end(dev,i);
>>  unsigned long flags = pci_resource_flags(dev,i);
>>@@ -222,10 +222,6 @@ void eeh_addr_cache_insert_dev(struct pci_dev *dev)
>>  {
>>  unsigned long flags;
>>
>>- /* Ignore PCI bridges */
>>- if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
>>- return;
>>-
>>  spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
>>  __eeh_addr_cache_insert_dev(dev);
>>  spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
>>
>
>
>-- 
>Alexey

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 05/12] powerpc/eeh: Cache only BARs, not windows or IOV BARs

2015-10-29 Thread Wei Yang
On Thu, Oct 29, 2015 at 02:29:19PM +1100, Daniel Axtens wrote:
>Wei Yang <weiy...@linux.vnet.ibm.com> writes:
>
>> EEH address cache, which helps to locate the PCI device according to
>> the given (physical) MMIO address, didn't cover PCI bridges. Also, it
>> shouldn't return PF with address in PF's IOV BARs. Instead, the VFs
>> should be returned.
>>
>> Also, by doing so, it removes the type check in
>> eeh_addr_cache_insert_dev(), since bridge's window would not be cached.
>>
>> The patch restricts the address cache to cover first 7 BARs for the
>> above purposes.
>If I've understoond the patch correctly, I think you want to swap the
>last two paragraphs in the commit message:
>
>"Restrict the address cache to cover the first 7 BARs...
>
>Since the window of a bridge will now not be cached, remove the type
>check..."
>

Hmm... my purpose in the last paragraphs is to state what the patch does and
the 2nd one is to mention another change in the log.

The order is both fine to me.

>With regards to the actual patch, I have now got access to the PCI and
>SR-IOV specs, but I'm still getting to grips with it all so let me know
>if something I say doesn't make sense.
>
>Here, you restrict the enumeration of resources to the standard and
>extension ROM resources (the first 7), which excludes enumeration of
>VF resources. That much I understand.
>
>I'm having more trouble convincing myself that it's safe or desirable to
>drop the test for bridges. I think I understand that the change to the
>for loop means it _should_ be safe, but is there any motivation for the
>change other than making the code more straightforward?
>

The motivation is just make the code more straightforward.

For a bridge device, the first 7 resources are not used and the last several
are not cached, This is the reason why I remove it in the patch.

>>  /* Walk resources on this device, poke them into the tree *
>This comment probably needs to be made more descriptive given the change.

Right, will change it.

>> -for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>> +for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>>  resource_size_t start = pci_resource_start(dev,i);
>>  resource_size_t end = pci_resource_end(dev,i);
>>  unsigned long flags = pci_resource_flags(dev,i);
>> @@ -222,10 +222,6 @@ void eeh_addr_cache_insert_dev(struct pci_dev *dev)
>>  {
>
>Regards,
>Daniel
>
>>  unsigned long flags;
>>  
>> -/* Ignore PCI bridges */
>> -if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
>> -return;
>> -
>>  spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
>>  __eeh_addr_cache_insert_dev(dev);
>>  spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
>> -- 
>> 2.5.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 03/12] powerpc/pci: Cache VF index in pci_dn

2015-10-29 Thread Wei Yang
On Fri, Oct 30, 2015 at 01:05:43PM +1100, Alexey Kardashevskiy wrote:
>On 10/26/2015 02:15 PM, Wei Yang wrote:
>>The patch caches the VF index in pci_dn, which can be used to calculate
>>VF's bus, device and function number. Those information helps to locate
>>the VF's PCI device instance when doing hotplug during EEH recovery if
>>necessary.
>
>
>The patch itself does not make much sense and quite small, I'd merge it into
>the one which makes use of this new vf_index.
>

Well, reasonable, will merge it.

>>
>>Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
>>Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/pci-bridge.h | 1 +
>>  arch/powerpc/kernel/pci_dn.c  | 4 +++-
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>>b/arch/powerpc/include/asm/pci-bridge.h
>>index b3a226b..3d7e537 100644
>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>@@ -210,6 +210,7 @@ struct pci_dn {
>>  #define IODA_INVALID_PE (-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  int pe_number;
>>+ int vf_index;   /* VF index in the PF */
>>  #ifdef CONFIG_PCI_IOV
>>  u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
>>  u16 num_vfs;/* number of VFs enabled*/
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index b3b4df9..f771130 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -139,6 +139,7 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
>>  #ifdef CONFIG_PCI_IOV
>>  static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
>> struct pci_dev *pdev,
>>+int vf_index,
>> int busno, int devfn)
>>  {
>>  struct pci_dn *pdn;
>>@@ -157,6 +158,7 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
>>*parent,
>>  pdn->parent = parent;
>>  pdn->busno = busno;
>>  pdn->devfn = devfn;
>>+ pdn->vf_index = vf_index;
>>  #ifdef CONFIG_PPC_POWERNV
>>  pdn->pe_number = IODA_INVALID_PE;
>>  #endif
>>@@ -196,7 +198,7 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
>>  return NULL;
>>
>>  for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>>- pdn = add_one_dev_pci_data(parent, NULL,
>>+ pdn = add_one_dev_pci_data(parent, NULL, i,
>> pci_iov_virtfn_bus(pdev, i),
>> pci_iov_virtfn_devfn(pdev, i));
>>  if (!pdn) {
>>
>
>
>-- 
>Alexey
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 03/12] powerpc/pci: Cache VF index in pci_dn

2015-10-27 Thread Wei Yang
On Wed, Oct 28, 2015 at 09:04:34AM +1100, Daniel Axtens wrote:
>Hi,
>
>>
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>> b/arch/powerpc/include/asm/pci-bridge.h
>> index b3a226b..3d7e537 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -210,6 +210,7 @@ struct pci_dn {
>>  #define IODA_INVALID_PE (-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  int pe_number;
>> +int vf_index;   /* VF index in the PF */
>
>Here, vf_index is inside CONFIG_PPC_POWERNV...
>
>>  #ifdef CONFIG_PCI_IOV
>>  u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
>>  u16 num_vfs;/* number of VFs enabled*/
>> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>> index b3b4df9..f771130 100644
>> --- a/arch/powerpc/kernel/pci_dn.c
>> +++ b/arch/powerpc/kernel/pci_dn.c
>> @@ -139,6 +139,7 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
>>  #ifdef CONFIG_PCI_IOV
>>  static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
>> struct pci_dev *pdev,
>> +   int vf_index,
>> int busno, int devfn)
>>  {
>>  struct pci_dn *pdn;
>> @@ -157,6 +158,7 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
>> *parent,
>>  pdn->parent = parent;
>>  pdn->busno = busno;
>>  pdn->devfn = devfn;
>> +pdn->vf_index = vf_index;
>>  #ifdef CONFIG_PPC_POWERNV
>>  pdn->pe_number = IODA_INVALID_PE;
>... but here, vf_index is outside CONFIG_PPC_POWERNV.
>

Hey, Daniel

Glad to see you comment. You are right, to be consistent this should be put
into the CONFIG_PPC_POWERNV. Will change it next version.

>Otherwise, the patch looks fine to me.
>
>I'm still trying to get my head around SR-IOV generally - once I do I
>will add any more comments I have or add a reviewed-by.
>
>Regards,
>Daniel



-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 01/12] PCI/IOV: Rename and export virtfn_add/virtfn_remove

2015-10-27 Thread Wei Yang
On Tue, Oct 27, 2015 at 06:06:54PM -0500, Bjorn Helgaas wrote:
>On Mon, Oct 26, 2015 at 11:15:51AM +0800, Wei Yang wrote:
>> During EEH recovery, hotplug is applied to the devices which don't
>> have drivers or their drivers don't support EEH. However, the hotplug,
>> which was implemented based on PCI bus, can't be applied to VF directly.
>> 
>> The patch renames virtn_{add,remove}() and exports them so that they
>> can be used in PCI hotplug during EEH recovery.
>
>Trivial, but write this as an imperative sentence, e.g.,
>
>  Rename virtn_{add,remove}() and export them so they
>  can be used in PCI hotplug during EEH recovery.
>
>"The patch" doesn't add any useful information; it's obvious that the
>changelog applied to this patch.

Yep, thanks, will change in next version.

>
>This comment also applies to at least the next patch.
>
>Bjorn

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V10 00/12] VF EEH on Power8

2015-10-27 Thread Wei Yang
On Tue, Oct 27, 2015 at 06:11:13PM -0500, Bjorn Helgaas wrote:
>On Mon, Oct 26, 2015 at 11:15:50AM +0800, Wei Yang wrote:
>> This patchset enables EEH on SRIOV VFs. The general idea is to create proper
>> VF edev and VF PE and handle them properly.
>> ...
>
>> Gavin Shan (1):
>>   powerpc/eeh: Don't block PCI config on resetting VF PE
>> 
>> Wei Yang (11):
>>   PCI/IOV: Rename and export virtfn_add/virtfn_remove
>>   PCI: Add pcibios_bus_add_device() weak function
>>   powerpc/pci: Cache VF index in pci_dn
>>   powerpc/pci: Remove VFs prior to PF
>>   powerpc/eeh: Cache only BARs, not windows or IOV BARs
>>   powerpc/powernv: EEH device for VF
>>   powerpc/eeh: Create PE for VFs
>>   powerpc/powernv: Support EEH reset for VF PE
>>   powerpc/powernv: Support PCI config restore for VFs
>>   powerpc/eeh: Support error recovery for VF PE
>>   powerpc/eeh: Handle hot removed VF when PF is EEH aware
>> 
>>  arch/powerpc/include/asm/eeh.h   |  10 ++
>>  arch/powerpc/include/asm/pci-bridge.h|   2 +
>>  arch/powerpc/kernel/eeh.c|  17 ++-
>>  arch/powerpc/kernel/eeh_cache.c  |   6 +-
>>  arch/powerpc/kernel/eeh_dev.c|   1 +
>>  arch/powerpc/kernel/eeh_driver.c | 130 
>>  arch/powerpc/kernel/eeh_pe.c |  13 +-
>>  arch/powerpc/kernel/pci-hotplug.c|   2 +-
>>  arch/powerpc/kernel/pci_dn.c |  16 +-
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 220 
>> ++-
>>  arch/powerpc/platforms/powernv/pci.c |  18 +++
>>  drivers/pci/bus.c|   3 +
>>  drivers/pci/iov.c|  10 +-
>>  include/linux/pci.h  |   8 +
>>  14 files changed, 408 insertions(+), 48 deletions(-)
>
>This really only affects powerpc, so I assume this series will go through
>the powerpc tree.  Let me know if you want me to do anything else.
>

Yep, as we talked about it, this will be merged in powerpc tree.

Have a good day :-)

>Bjorn

-- 
Richard Yang
Help you, Help me

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 09/12] powerpc/powernv: Support PCI config restore for VFs

2015-10-25 Thread Wei Yang
After PE reset, OPAL API opal_pci_reinit() is called on all devices
contained in the PE to reinitialize them. However, VFs can't be seen
from skiboot firmware. We have to implement the functions, similar
those in skiboot firmware, to reinitialize VFs after reset on PE
for VFs.

[gwshan: changelog and code refactoring]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h|  1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c | 70 +++-
 arch/powerpc/platforms/powernv/pci.c | 18 +++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 3d7e537..e499d93 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -219,6 +219,7 @@ struct pci_dn {
 #define IODA_INVALID_M64(-1)
int (*m64_map)[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
+   int mps;
 #endif
struct list_head child_list;
struct list_head list;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 017cd72..3cc3e76 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1616,6 +1616,67 @@ static int pnv_eeh_next_error(struct eeh_pe **pe)
return ret;
 }
 
+static int pnv_eeh_restore_vf_config(struct pci_dn *pdn)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 devctl, cmd, cap2, aer_capctl;
+   int old_mps;
+
+   /* Restore MPS */
+   if (edev->pcie_cap) {
+   old_mps = (ffs(pdn->mps) - 8) << 5;
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+2, );
+   devctl &= ~PCI_EXP_DEVCTL_PAYLOAD;
+   devctl |= old_mps;
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+ 2, devctl);
+   }
+
+   /* Disable Completion Timeout */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP2,
+4, );
+   if (cap2 & 0x10) {
+   eeh_ops->read_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, );
+   cap2 |= 0x10;
+   eeh_ops->write_config(pdn,
+   edev->pcie_cap + PCI_EXP_DEVCTL2,
+   4, cap2);
+   }
+   }
+
+   /* Enable SERR and parity checking */
+   eeh_ops->read_config(pdn, PCI_COMMAND, 2, );
+   cmd |= (PCI_COMMAND_PARITY | PCI_COMMAND_SERR);
+   eeh_ops->write_config(pdn, PCI_COMMAND, 2, cmd);
+
+   /* Enable report various errors */
+   if (edev->pcie_cap) {
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, );
+   devctl &= ~PCI_EXP_DEVCTL_CERE;
+   devctl |= (PCI_EXP_DEVCTL_NFERE |
+  PCI_EXP_DEVCTL_FERE |
+  PCI_EXP_DEVCTL_URRE);
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+   2, devctl);
+   }
+
+   /* Enable ECRC generation and check */
+   if (edev->pcie_cap && edev->aer_cap) {
+   eeh_ops->read_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, _capctl);
+   aer_capctl |= (PCI_ERR_CAP_ECRC_GENE | PCI_ERR_CAP_ECRC_CHKE);
+   eeh_ops->write_config(pdn, edev->aer_cap + PCI_ERR_CAP,
+   4, aer_capctl);
+   }
+
+   return 0;
+}
+
 static int pnv_eeh_restore_config(struct pci_dn *pdn)
 {
struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
@@ -1626,7 +1687,14 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
return -EEXIST;
 
phb = edev->phb->private_data;
-   ret = opal_pci_reinit(phb->opal_id,
+   /*
+* We have to restore the PCI config space after reset since the
+* firmware can't see SRIOV VFs.
+*/
+   if (edev->physfn)
+   ret = pnv_eeh_restore_vf_config(pdn);
+   else
+   ret = opal_pci_reinit(phb->opal_id,
  OPAL_REINIT_PCI_DEV, edev->config_addr);
if (ret) {
pr_warn("%s: Can't reinit PCI dev 0x%x (%lld)\n",
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 765d8ed..0e4f42e 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@

[PATCH V10 10/12] powerpc/eeh: Support error recovery for VF PE

2015-10-25 Thread Wei Yang
Different from PCI bus dependent PE, PE for VFs doesn't have the
primary bus, on which the PCI hotplug is implemented. The patch
supports error recovery, especially the PCI hotplug for VF's PE.
The hotplug on VF's PE is implemented based on VFs, instead of
PCI bus any more.

[gwshan: changelog and code refactoring]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |   1 +
 arch/powerpc/kernel/eeh.c|   8 
 arch/powerpc/kernel/eeh_driver.c | 100 +++
 arch/powerpc/kernel/eeh_pe.c |   3 +-
 4 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 331c856..ea1f13c4 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -142,6 +142,7 @@ struct eeh_dev {
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   intin_error;/* Error flag for eeh_dev   */
struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index af9b597..28e4d73 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1227,6 +1227,14 @@ void eeh_remove_device(struct pci_dev *dev)
 * from the parent PE during the BAR resotre.
 */
edev->pdev = NULL;
+
+   /*
+* The flag "in_error" is used to trace EEH devices for VFs
+* in error state or not. It's set in eeh_report_error(). If
+* it's not set, eeh_report_{reset,resume}() won't be called
+* for the VF EEH device.
+*/
+   edev->in_error = 0;
dev->dev.archdata.edev = NULL;
if (!(edev->pe->state & EEH_PE_KEEP))
eeh_rmv_from_parent_pe(edev);
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 89eb4bc..99868e2 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -211,6 +211,7 @@ static void *eeh_report_error(void *data, void *userdata)
if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
if (*res == PCI_ERS_RESULT_NONE) *res = rc;
 
+   edev->in_error = 1;
eeh_pcid_put(dev);
return NULL;
 }
@@ -282,7 +283,8 @@ static void *eeh_report_reset(void *data, void *userdata)
 
if (!driver->err_handler ||
!driver->err_handler->slot_reset ||
-   (edev->mode & EEH_DEV_NO_HANDLER)) {
+   (edev->mode & EEH_DEV_NO_HANDLER) ||
+   (!edev->in_error)) {
eeh_pcid_put(dev);
return NULL;
}
@@ -339,14 +341,16 @@ static void *eeh_report_resume(void *data, void *userdata)
 
if (!driver->err_handler ||
!driver->err_handler->resume ||
-   (edev->mode & EEH_DEV_NO_HANDLER)) {
+   (edev->mode & EEH_DEV_NO_HANDLER) ||
+   (!edev->in_error)) {
edev->mode &= ~EEH_DEV_NO_HANDLER;
-   eeh_pcid_put(dev);
-   return NULL;
+   goto out;
}
 
driver->err_handler->resume(dev);
 
+out:
+   edev->in_error = 0;
eeh_pcid_put(dev);
return NULL;
 }
@@ -386,12 +390,38 @@ static void *eeh_report_failure(void *data, void 
*userdata)
return NULL;
 }
 
+static void *eeh_add_virt_device(void *data, void *userdata)
+{
+   struct pci_driver *driver;
+   struct eeh_dev *edev = (struct eeh_dev *)data;
+   struct pci_dev *dev = eeh_dev_to_pci_dev(edev);
+   struct pci_dn *pdn = eeh_dev_to_pdn(edev);
+
+   if (!(edev->physfn)) {
+   pr_warn("%s: EEH dev %04x:%02x:%02x.%01x not for VF\n",
+   __func__, edev->phb->global_number, pdn->busno,
+   PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
+   return NULL;
+   }
+
+   driver = eeh_pcid_get(dev);
+   if (driver) {
+   eeh_pcid_put(dev);
+   if (driver->err_handler)
+   return NULL;
+   }
+
+   pci_iov_virtfn_add(edev->physfn, pdn->vf_index, 0);
+   return NULL;
+}
+
 static void *eeh_rmv_device(void *data, void *userdata)
 {
struct pci_driver *driver;
struct eeh_dev *edev = (struct eeh_dev *)data;
struct pci_dev *dev = eeh_dev_to_pci_dev(edev);
int *removed = (int *)userdata;
+   struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 
/*
 * Actually, we should remove the PCI bridges as well.
@@ -416,7 +446,7 

[PATCH V10 08/12] powerpc/powernv: Support EEH reset for VF PE

2015-10-25 Thread Wei Yang
PEs for VFs don't have primary bus. So they have to have their own reset
backend, which is used during EEH recovery. The patch implements the reset
backend for VF's PE by issuing FLR or AF FLR to the VFs, which are contained
in the PE.

[gwshan: changelog and code refactoring]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |   1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c | 134 ++-
 2 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index ec21f8f..331c856 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -136,6 +136,7 @@ struct eeh_dev {
int pcix_cap;   /* Saved PCIx capability*/
int pcie_cap;   /* Saved PCIe capability*/
int aer_cap;/* Saved AER capability */
+   int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
struct pci_controller *phb; /* Associated PHB   */
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index cfd55dd..017cd72 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -404,6 +404,7 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
edev->pcix_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_PCIX);
edev->pcie_cap = pnv_eeh_find_cap(pdn, PCI_CAP_ID_EXP);
edev->aer_cap  = pnv_eeh_find_ecap(pdn, PCI_EXT_CAP_ID_ERR);
+   edev->af_cap   = pnv_eeh_find_cap(pdn, PCI_CAP_ID_AF);
if ((edev->class_code >> 8) == PCI_CLASS_BRIDGE_PCI) {
edev->mode |= EEH_DEV_BRIDGE;
if (edev->pcie_cap) {
@@ -893,6 +894,127 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int 
option)
return 0;
 }
 
+static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, int pos,
+u16 mask, bool af_flr_rst)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   int status, i;
+
+   /* Wait for Transaction Pending bit to be cleared */
+   for (i = 0; i < 4; i++) {
+   eeh_ops->read_config(pdn, pos, 2, );
+   if (!(status & mask))
+   return;
+
+   msleep((1 << i) * 100);
+   }
+
+   pr_warn("%s: Pending transaction while issuing %s FLR to "
+   "%04x:%02x:%02x.%01x\n",
+   __func__, af_flr_rst ? "AF" : "",
+   edev->phb->global_number, pdn->busno,
+   PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
+}
+
+static int pnv_eeh_do_flr(struct pci_dn *pdn, int option)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 reg;
+
+   if (!edev->pcie_cap)
+   return -ENOTTY;
+
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP, 4, );
+   if (!(reg & PCI_EXP_DEVCAP_FLR))
+   return -ENOTTY;
+
+   switch (option) {
+   case EEH_RESET_HOT:
+   case EEH_RESET_FUNDAMENTAL:
+   pnv_eeh_wait_for_pending(pdn, edev->pcie_cap + PCI_EXP_DEVSTA,
+PCI_EXP_DEVSTA_TRPND, false);
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+4, );
+   reg |= PCI_EXP_DEVCTL_BCR_FLR;
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+ 4, reg);
+   msleep(EEH_PE_RST_HOLD_TIME);
+   break;
+   case EEH_RESET_DEACTIVATE:
+   eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+4, );
+   reg &= ~PCI_EXP_DEVCTL_BCR_FLR;
+   eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
+ 4, reg);
+   msleep(EEH_PE_RST_SETTLE_TIME);
+   break;
+   }
+
+   return 0;
+}
+
+static int pnv_eeh_do_af_flr(struct pci_dn *pdn, int option)
+{
+   struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+   u32 cap;
+
+   if (!edev->af_cap)
+   return -ENOTTY;
+
+   eeh_ops->read_config(pdn, edev->af_cap + PCI_AF_CAP, 1, );
+   if (!(cap & PCI_AF_CAP_TP) || !(cap & PCI_AF_CAP_FLR))
+   return -ENOTTY;
+
+   switch (option) {
+   case EEH_RESET_HOT:
+   case EEH_RESET_FUNDAMENTAL:
+   /*
+* Wait for Transact

[PATCH V10 07/12] powerpc/eeh: Create PE for VFs

2015-10-25 Thread Wei Yang
Current EEH recovery code works with the assumption: the PE has primary
bus. Unfortunately, that's not true for VF PEs, which generally contains
one or multiple VFs (for VF group case).

The patch creates PEs for VFs in the weak function
pcibios_bus_add_device(). Those PEs for VFs are identified with newly
introduced flag EEH_PE_VF so that we handle them differently during EEH
recovery.

[gwshan: changelog and code refactoring]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |  1 +
 arch/powerpc/kernel/eeh_pe.c | 10 --
 arch/powerpc/platforms/powernv/eeh-powernv.c | 16 
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 6c383ad..ec21f8f 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct pci_dn;
 #define EEH_PE_PHB (1 << 1)/* PHB PE*/
 #define EEH_PE_DEVICE  (1 << 2)/* Device PE */
 #define EEH_PE_BUS (1 << 3)/* Bus PE*/
+#define EEH_PE_VF  (1 << 4)/* VF PE */
 
 #define EEH_PE_ISOLATED(1 << 0)/* Isolated PE  
*/
 #define EEH_PE_RECOVERING  (1 << 1)/* Recovering PE*/
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 35f0b62..260a701 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -299,7 +299,10 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev 
*edev)
 * EEH device already having associated PE, but
 * the direct parent EEH device doesn't have yet.
 */
-   pdn = pdn ? pdn->parent : NULL;
+   if (edev->physfn)
+   pdn = pci_get_pdn(edev->physfn);
+   else
+   pdn = pdn ? pdn->parent : NULL;
while (pdn) {
/* We're poking out of PCI territory */
parent = pdn_to_eeh_dev(pdn);
@@ -382,7 +385,10 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
}
 
/* Create a new EEH PE */
-   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
+   if (edev->physfn)
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_VF);
+   else
+   pe = eeh_pe_alloc(edev->phb, EEH_PE_DEVICE);
if (!pe) {
pr_err("%s: out of memory!\n", __func__);
return -ENOMEM;
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 7cf0df8..cfd55dd 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1524,6 +1524,22 @@ static struct eeh_ops pnv_eeh_ops = {
.restore_config = pnv_eeh_restore_config
 };
 
+void pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+}
+
 /**
  * eeh_powernv_init - Register platform dependent EEH operations
  *
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 05/12] powerpc/eeh: Cache only BARs, not windows or IOV BARs

2015-10-25 Thread Wei Yang
EEH address cache, which helps to locate the PCI device according to
the given (physical) MMIO address, didn't cover PCI bridges. Also, it
shouldn't return PF with address in PF's IOV BARs. Instead, the VFs
should be returned.

Also, by doing so, it removes the type check in
eeh_addr_cache_insert_dev(), since bridge's window would not be cached.

The patch restricts the address cache to cover first 7 BARs for the
above purposes.

[gwshan: changelog]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_cache.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index a1e86e1..e6887f0 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -196,7 +196,7 @@ static void __eeh_addr_cache_insert_dev(struct pci_dev *dev)
}
 
/* Walk resources on this device, poke them into the tree */
-   for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
+   for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start = pci_resource_start(dev,i);
resource_size_t end = pci_resource_end(dev,i);
unsigned long flags = pci_resource_flags(dev,i);
@@ -222,10 +222,6 @@ void eeh_addr_cache_insert_dev(struct pci_dev *dev)
 {
unsigned long flags;
 
-   /* Ignore PCI bridges */
-   if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
-   return;
-
spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
__eeh_addr_cache_insert_dev(dev);
spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 03/12] powerpc/pci: Cache VF index in pci_dn

2015-10-25 Thread Wei Yang
The patch caches the VF index in pci_dn, which can be used to calculate
VF's bus, device and function number. Those information helps to locate
the VF's PCI device instance when doing hotplug during EEH recovery if
necessary.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h | 1 +
 arch/powerpc/kernel/pci_dn.c  | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index b3a226b..3d7e537 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -210,6 +210,7 @@ struct pci_dn {
 #define IODA_INVALID_PE(-1)
 #ifdef CONFIG_PPC_POWERNV
int pe_number;
+   int vf_index;   /* VF index in the PF */
 #ifdef CONFIG_PCI_IOV
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index b3b4df9..f771130 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -139,6 +139,7 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 #ifdef CONFIG_PCI_IOV
 static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
   struct pci_dev *pdev,
+  int vf_index,
   int busno, int devfn)
 {
struct pci_dn *pdn;
@@ -157,6 +158,7 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
pdn->parent = parent;
pdn->busno = busno;
pdn->devfn = devfn;
+   pdn->vf_index = vf_index;
 #ifdef CONFIG_PPC_POWERNV
pdn->pe_number = IODA_INVALID_PE;
 #endif
@@ -196,7 +198,7 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
return NULL;
 
for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
-   pdn = add_one_dev_pci_data(parent, NULL,
+   pdn = add_one_dev_pci_data(parent, NULL, i,
   pci_iov_virtfn_bus(pdev, i),
   pci_iov_virtfn_devfn(pdev, i));
if (!pdn) {
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 02/12] PCI: Add pcibios_bus_add_device() weak function

2015-10-25 Thread Wei Yang
This patch adds a weak function pcibios_bus_add_device() for arch dependent
code could do proper setup. For example, powerpc could setup EEH related
resources.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/bus.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 6fbd3f2..b7e30a7 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -267,6 +267,7 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx)
 
 void __weak pcibios_resource_survey_bus(struct pci_bus *bus) { }
 
+void __weak pcibios_bus_add_device(struct pci_dev *dev) { }
 /**
  * pci_bus_add_device - start driver for a single device
  * @dev: device to add
@@ -277,6 +278,8 @@ void pci_bus_add_device(struct pci_dev *dev)
 {
int retval;
 
+   pcibios_bus_add_device(dev);
+
/*
 * Can not put in pci_device_add yet because resources
 * are not assigned yet for some devices.
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 00/12] VF EEH on Power8

2015-10-25 Thread Wei Yang
This patchset enables EEH on SRIOV VFs. The general idea is to create proper
VF edev and VF PE and handle them properly.

Different from the Bus PE, VF PE just contain one VF. This introduces the
difference of EEH error handling on a VF PE. Generally, it has several
differences.

First, the VF's removal and re-enumerate rely on its PF. VF has a tight
relationship between its PF. This is not proper to enumerate a VF by usual
scan procedure. That's why virtfn_add/virtfn_remove are exported in this patch
set.

Second, the reset/restore of a VF is done in kernel space. FW is not aware of
the VF, this means the usual reset function done in FW will not work. One of
the patch will imitate the reset/restore function in kernel space.

Third, the VF may be removed during the PF's error_detected function. In this
case, the original error_detected->slot_reset->resume sequence is not proper
to those removed VFs, since they are re-created by PF in a fresh state. A flag
in eeh_dev is introduce to mark the eeh_dev is in error state. By doing so, we
track whether this device needs to be reset or not.

This has been tested both on host and in guest on Power8 with latest kernel
version.

v10:
   * delete the last patch "powerpc/powernv: compound PE for VFs" since after
 redesign of SRIOV, there is no compound PE for VFs now.
   * add two patches which fix problems found during tests
 powerpc/eeh: Support error recovery for VF PE  
   
 powerpc/eeh: Handle hot removed VF when PF is EEH aware
v9:
   * split pcibios_bus_add_device() into a separate patch
   * Bjorn acked the PCI part and agreed this patch set to be merged from ppc
 tree
   * rebased on mpe/linux.git next branch
v8:
   * fix on checking the return value of pnv_eeh_do_flr()
   * introduced a weak function pcibios_bus_add_device() to create PE for VFs
v7:
   * fix compile error when PCI_IOV is not set
v6:
   * code / commit log refactor by Gavin
v5:
   * remove the compound field, iterate on Master VF PE instead
   * some code refine on PCI config restore and reset on VF
 the wait time for assert and deassert
 PCI device address format
 check on edev->pcie_cap and edev->aer_cap before access them
v4:
   * refine the change logs, comment and code style
   * change pnv_pci_fixup_vf_eeh() to pnv_eeh_vf_final_fixup() and remove the
 CONFIG_PCI_IOV macro
   * reorder patch 5/6 to make the logic more reasonable
   * remove remove_dev_pci_data()
   * remove the EEH_DEV_VF flag, use edev->physfn to identify a VF EEH DEV and
 remove related CONFIG_PCI_IOV macro
   * add the option for VF reset
   * fix the pnv_eeh_cfg_blocked() logic
   * replace pnv_pci_cfg_{read,write} with eeh_ops->{read,write}_config in
 pnv_eeh_vf_restore_config()
   * rename pnv_eeh_vf_restore_config() to pnv_eeh_restore_vf_config()
   * rename pnv_pci_fixup_vf_caps() to pnv_pci_vf_header_fixup() and move it
 to arch/powerpc/platforms/powernv/pci.c
   * add a field compound in pnv_ioda_pe to link compound PEs
   * handle compound PE for VF PEs
v3:
   * add back vf_index in pci_dn to track the VF's index
   * rename ppdev in eeh_dev to physfn for consistency
   * move edev->physfn assignment before dev->dev.archdata.edev is set
   * move pnv_pci_fixup_vf_eeh() and pnv_pci_fixup_vf_caps() to eeh-powernv.c
   * more clear and detail in commit log and comment in code
   * merge eeh_rmv_virt_device() with eeh_rmv_device()
   * move the cfg_blocked check logic from pnv_eeh_read/write_config() to
 pnv_eeh_cfg_blocked()
   * move the vf reset/restore logic into its own patch, two patches are
 created.
 powerpc/powernv: Support PCI config restore for VFs
 powerpc/powernv: Support EEH reset for VFs
   * simplify the vf reset logic
v2:
   * add prefix pci_iov_ to virtfn_add/virtfn_remove
   * use EEH_DEV_VF as a flag for a VF's eeh_dev
   * use eeh_dev instead of edev in change log
   * remove vf_index in eeh_dev, calculate it from pdn->busno and devfn
   * do eeh_add_device_late() and eeh_sysfs_add_device() both after pci_dev is
 well initialized
   * do FLR to reset a VF PE
   * imitate the restore function in FW for VF
   * remove the reverse order patch, since it is still under discussion

Gavin Shan (1):
  powerpc/eeh: Don't block PCI config on resetting VF PE

Wei Yang (11):
  PCI/IOV: Rename and export virtfn_add/virtfn_remove
  PCI: Add pcibios_bus_add_device() weak function
  powerpc/pci: Cache VF index in pci_dn
  powerpc/pci: Remove VFs prior to PF
  powerpc/eeh: Cache only BARs, not windows or IOV BARs
  powerpc/powernv: EEH device for VF
  powerpc/eeh: Create PE for VFs
  powerpc/powernv: Support EEH reset for VF PE
  powerpc/powernv: Support PCI config restore for VFs
  powerpc/eeh: Support error recovery for VF PE
  powerpc/eeh: Handle hot removed VF when PF is EEH aware

 arch/powerpc/include/asm/eeh.h   |  10 ++
 arch/powerpc/include/asm/pci-bridge.h 

[PATCH V10 01/12] PCI/IOV: Rename and export virtfn_add/virtfn_remove

2015-10-25 Thread Wei Yang
During EEH recovery, hotplug is applied to the devices which don't
have drivers or their drivers don't support EEH. However, the hotplug,
which was implemented based on PCI bus, can't be applied to VF directly.

The patch renames virtn_{add,remove}() and exports them so that they
can be used in PCI hotplug during EEH recovery.

[gwshan: changelog]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/iov.c   | 10 +-
 include/linux/pci.h |  8 
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..cc941dd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -108,7 +108,7 @@ resource_size_t pci_iov_resource_size(struct pci_dev *dev, 
int resno)
return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
 }
 
-static int virtfn_add(struct pci_dev *dev, int id, int reset)
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
 {
int i;
int rc = -ENOMEM;
@@ -183,7 +183,7 @@ failed:
return rc;
 }
 
-static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset)
 {
char buf[VIRTFN_ID_LEN];
struct pci_dev *virtfn;
@@ -320,7 +320,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
}
 
for (i = 0; i < initial; i++) {
-   rc = virtfn_add(dev, i, 0);
+   rc = pci_iov_virtfn_add(dev, i, 0);
if (rc)
goto failed;
}
@@ -332,7 +332,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 
 failed:
for (j = 0; j < i; j++)
-   virtfn_remove(dev, j, 0);
+   pci_iov_virtfn_remove(dev, j, 0);
 
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -361,7 +361,7 @@ static void sriov_disable(struct pci_dev *dev)
return;
 
for (i = 0; i < iov->num_VFs; i++)
-   virtfn_remove(dev, i, 0);
+   pci_iov_virtfn_remove(dev, i, 0);
 
pcibios_sriov_disable(dev);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 860c751..b854a5f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1669,6 +1669,8 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
+int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset);
+void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int reset);
 int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
@@ -1686,6 +1688,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev 
*dev, int id)
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
+static inline int pci_iov_virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+   return -ENOSYS;
+}
+static inline void pci_iov_virtfn_remove(struct pci_dev *dev, int id, int 
reset)
+{ }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
 { return 0; }
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 11/12] powerpc/eeh: Don't block PCI config on resetting VF PE

2015-10-25 Thread Wei Yang
From: Gavin Shan 

When passing through SRIOV VF from host to guest via VFIO PCI
infrastructure, the VF is resetted by EEH specific backend
(pcibios_set_pcie_reset_state()). We can't block the PCI config,
otherwise, the reset (FLR or AF FLR), to be completed by PCI
config access to the VF, can't be done. Then the VF can't be
put into initial state when passing it to the guest and returning
back to the host.

The patch just doesn't block the VF's PCI config space when doing
the reset. It fixes EEH error caused by DMA traffic to bogus DMA
address on restarting guest after killing the QEMU process, which
includes Mellanox VF passed through from host.

Reported-by: Alexey Kardashevskiy 
Signed-off-by: Gavin Shan 
Tested-by: Alexey Kardashevskiy 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/eeh.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 28e4d73..e1846f5 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -745,7 +745,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
case pcie_deassert_reset:
eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
eeh_unfreeze_pe(pe, false);
-   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
break;
@@ -753,14 +754,16 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
enum pcie_reset_state stat
eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_HOT);
break;
case pcie_warm_reset:
eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
-   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
+   if (!(pe->type & EEH_PE_VF))
+   eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops->reset(pe, EEH_RESET_FUNDAMENTAL);
break;
default:
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 04/12] powerpc/pci: Remove VFs prior to PF

2015-10-25 Thread Wei Yang
As commit ac205b7bb72f ("PCI: make sriov work with hotplug remove") indicates,
VFs, which might be hooked to same PCI bus as their PF should be removed
before the PF. Otherwise, the PCI hot unplugging on the PCI bus would
cause kernel crash.

The patch applies the above pattern to PowerPC PCI hotplug path.

[gwshan: changelog]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index 7f9ed0c..59c4361 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -55,7 +55,7 @@ void pcibios_remove_pci_devices(struct pci_bus *bus)
 
pr_debug("PCI: Removing devices on bus %04x:%02x\n",
 pci_domain_nr(bus),  bus->number);
-   list_for_each_entry_safe(dev, tmp, >devices, bus_list) {
+   list_for_each_entry_safe_reverse(dev, tmp, >devices, bus_list) {
pr_debug("   Removing %s...\n", pci_name(dev));
pci_stop_and_remove_bus_device(dev);
}
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 06/12] powerpc/powernv: EEH device for VF

2015-10-25 Thread Wei Yang
VFs and their corresponding pci_dn instances are created and released
dynamically as their PF's SRIOV capability is enabled and disabled.
The patch creates and releases EEH devices for VFs when creating and
releasing their pci_dn instances, which means EEH devices and pci_dn
instances have same life cycle. Also, VF's EEH device is identified
by (struct eeh_dev::physfn).

[gwshan: changelog and removed CONFIG_PCI_IOV]
Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gws...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  1 +
 arch/powerpc/kernel/pci_dn.c   | 12 
 2 files changed, 13 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c5eb86f..6c383ad 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -140,6 +140,7 @@ struct eeh_dev {
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
+   struct pci_dev *physfn; /* Associated PF PORT   */
struct pci_bus *bus;/* PCI bus for partial hotplug  */
 };
 
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index f771130..f0ddde7 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -180,7 +180,9 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
 struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 {
 #ifdef CONFIG_PCI_IOV
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pci_dn *parent, *pdn;
+   struct eeh_dev *edev;
int i;
 
/* Only support IOV for now */
@@ -206,6 +208,9 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
 __func__, i);
return NULL;
}
+   eeh_dev_init(pdn, hose);
+   edev = pdn_to_eeh_dev(pdn);
+   edev->physfn = pdev;
}
 #endif /* CONFIG_PCI_IOV */
 
@@ -254,10 +259,17 @@ void remove_dev_pci_data(struct pci_dev *pdev)
for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
list_for_each_entry_safe(pdn, tmp,
>child_list, list) {
+   struct eeh_dev *edev;
if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
continue;
 
+   edev = pdn_to_eeh_dev(pdn);
+   if (edev) {
+   pdn->edev = NULL;
+   kfree(edev);
+   }
+
if (!list_empty(>list))
list_del(>list);
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V10 12/12] powerpc/eeh: Handle hot removed VF when PF is EEH aware

2015-10-25 Thread Wei Yang
When PF is EEH aware while VFs are not, VFs will be removed during EEH
recovery. This is not supported in current code, while will leads to the VF
lost.

This patch fixes this by adding VFs back. VFs should be added back after PF
get recovered properly.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/include/asm/eeh.h   |  6 ++
 arch/powerpc/kernel/eeh_dev.c|  1 +
 arch/powerpc/kernel/eeh_driver.c | 30 +++---
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index ea1f13c4..c529a23 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -127,6 +127,11 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
 #define EEH_DEV_SYSFS  (1 << 9)/* Sysfs created*/
 #define EEH_DEV_REMOVED(1 << 10)   /* Removed permanently  
*/
 
+struct eeh_rmv_data {
+   struct list_head edev_list;
+   int removed;
+};
+
 struct eeh_dev {
int mode;   /* EEH mode */
int class_code; /* Class code of the device */
@@ -139,6 +144,7 @@ struct eeh_dev {
int af_cap; /* Saved AF capability  */
struct eeh_pe *pe;  /* Associated PE*/
struct list_head list;  /* Form link list in the PE */
+   struct list_head rmv_list;  /* Record the removed edev  */
struct pci_controller *phb; /* Associated PHB   */
struct pci_dn *pdn; /* Associated PCI device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index aabba94..7815095 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -67,6 +67,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
edev->pdn = pdn;
edev->phb = phb;
INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>rmv_list);
 
return NULL;
 }
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 99868e2..f2406b4 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -420,7 +420,8 @@ static void *eeh_rmv_device(void *data, void *userdata)
struct pci_driver *driver;
struct eeh_dev *edev = (struct eeh_dev *)data;
struct pci_dev *dev = eeh_dev_to_pci_dev(edev);
-   int *removed = (int *)userdata;
+   struct eeh_rmv_data *rmv_data = (struct eeh_rmv_data *)userdata;
+   int *removed = rmv_data ? _data->removed : NULL;
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 
/*
@@ -467,6 +468,9 @@ static void *eeh_rmv_device(void *data, void *userdata)
 * required to plug the VF successfully.
 */
pdn->pe_number = IODA_INVALID_PE;
+
+   if (rmv_data)
+   list_add(>rmv_list, _data->edev_list);
} else {
pci_lock_rescan_remove();
pci_stop_and_remove_bus_device(dev);
@@ -585,11 +589,12 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
  * During the reset, udev might be invoked because those affected
  * PCI devices will be removed and then added.
  */
-static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
+static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
+   struct eeh_rmv_data *rmv_data)
 {
struct pci_bus *frozen_bus = eeh_pe_bus_get(pe);
struct timeval tstamp;
-   int cnt, rc, removed = 0;
+   int cnt, rc;
struct eeh_dev *edev;
 
/* pcibios will clear the counter; save the value */
@@ -612,7 +617,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus)
pci_unlock_rescan_remove();
}
} else if (frozen_bus)
-   eeh_pe_dev_traverse(pe, eeh_rmv_device, );
+   eeh_pe_dev_traverse(pe, eeh_rmv_device, rmv_data);
 
/*
 * Reset the pci controller. (Asserts RST#; resets config space).
@@ -659,7 +664,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus)
eeh_add_virt_device(edev, NULL);
else
pcibios_add_pci_devices(bus);
-   } else if (frozen_bus && removed) {
+   } else if (frozen_bus && rmv_data->removed) {
pr_info("EEH: Sleep 5s ahead of partial hotplug\n");
ssleep(5);
 
@@ -687,8 +692,10 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus)
 static void eeh_handle_normal_event(struct eeh_pe *pe)
 {
struct pci_bus *frozen_bus;
+   struct eeh_dev *edev, *tmp;

[PATCH V7 5/6] powerpc/powernv: boundary the total VF BAR size instead of the individual one

2015-10-21 Thread Wei Yang
Each VF could have 6 BARs at most. When the total BAR size exceeds the
gate, after expanding it will also exhaust the M64 Window.

This patch limits the boundary by checking the total VF BAR size instead of
the individual BAR.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 0add35f..1c11b1a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2701,7 +2701,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
const resource_size_t gate = phb->ioda.m64_segsize >> 2;
struct resource *res;
int i;
-   resource_size_t size;
+   resource_size_t size, total_vf_bar_sz;
struct pci_dn *pdn;
int mul, total_vfs;
 
@@ -2714,6 +2714,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
 
total_vfs = pci_sriov_get_totalvfs(pdev);
mul = phb->ioda.total_pe;
+   total_vf_bar_sz = 0;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
@@ -2726,7 +2727,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
goto truncate_iov;
}
 
-   size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
+   total_vf_bar_sz += pci_iov_resource_size(pdev,
+   i + PCI_IOV_RESOURCES);
 
/*
 * If bigger than quarter of M64 segment size, just round up
@@ -2740,11 +2742,11 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
 * limit the system flexibility.  This is a design decision to
 * set the boundary to quarter of the M64 segment size.
 */
-   if (size > gate) {
-   dev_info(>dev, "PowerNV: VF BAR%d: %pR IOV size "
-   "is bigger than %lld, roundup power2\n",
-i, res, gate);
+   if (total_vf_bar_sz > gate) {
mul = roundup_pow_of_two(total_vfs);
+   dev_info(>dev,
+   "VF BAR Total IOV size %llx > %llx, roundup to 
%d VFs\n",
+   total_vf_bar_sz, gate, mul);
pdn->m64_single_mode = true;
break;
}
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V7 6/6] powerpc/powernv: allocate sparse PE# when using M64 BAR in Single PE mode

2015-10-21 Thread Wei Yang
When M64 BAR is set to Single PE mode, the PE# assigned to VF could be
sparse.

This patch restructures the code to allocate sparse PE# for VFs when M64
BAR is set to Single PE mode. Also it rename the offset to pe_num_map to
reflect the content is the PE number.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/include/asm/pci-bridge.h |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 81 +++
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 8aeba4c..b3a226b 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -213,7 +213,7 @@ struct pci_dn {
 #ifdef CONFIG_PCI_IOV
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
-   int offset; /* PE# for the first VF PE */
+   int *pe_num_map;/* PE# for the first VF PE or array */
boolm64_single_mode;/* Use M64 BAR in Single Mode */
 #define IODA_INVALID_M64(-1)
int (*m64_map)[PCI_SRIOV_NUM_BARS];
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1c11b1a..91be853 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1243,7 +1243,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
/* Map the M64 here */
if (pdn->m64_single_mode) {
-   pe_num = pdn->offset + j;
+   pe_num = pdn->pe_num_map[j];
rc = opal_pci_map_pe_mmio_window(phb->opal_id,
pe_num, OPAL_M64_WINDOW_TYPE,
pdn->m64_map[j][i], 0);
@@ -1347,7 +1347,7 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
struct pnv_phb*phb;
struct pci_dn *pdn;
struct pci_sriov  *iov;
-   u16 num_vfs;
+   u16num_vfs, i;
 
bus = pdev->bus;
hose = pci_bus_to_host(bus);
@@ -1361,14 +1361,21 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 
if (phb->type == PNV_PHB_IODA2) {
if (!pdn->m64_single_mode)
-   pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+   pnv_pci_vf_resource_shift(pdev, -*pdn->pe_num_map);
 
/* Release M64 windows */
pnv_pci_vf_release_m64(pdev, num_vfs);
 
/* Release PE numbers */
-   bitmap_clear(phb->ioda.pe_alloc, pdn->offset, num_vfs);
-   pdn->offset = 0;
+   if (pdn->m64_single_mode) {
+   for (i = 0; i < num_vfs; i++) {
+   if (pdn->pe_num_map[i] != IODA_INVALID_PE)
+   pnv_ioda_free_pe(phb, 
pdn->pe_num_map[i]);
+   }
+   } else
+   bitmap_clear(phb->ioda.pe_alloc, *pdn->pe_num_map, 
num_vfs);
+   /* Releasing pe_num_map */
+   kfree(pdn->pe_num_map);
}
 }
 
@@ -1394,7 +1401,10 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
u16 num_vfs)
 
/* Reserve PE for each VF */
for (vf_index = 0; vf_index < num_vfs; vf_index++) {
-   pe_num = pdn->offset + vf_index;
+   if (pdn->m64_single_mode)
+   pe_num = pdn->pe_num_map[vf_index];
+   else
+   pe_num = *pdn->pe_num_map + vf_index;
 
pe = >ioda.pe_array[pe_num];
pe->pe_number = pe_num;
@@ -1436,6 +1446,7 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
struct pnv_phb*phb;
struct pci_dn *pdn;
intret;
+   u16i;
 
bus = pdev->bus;
hose = pci_bus_to_host(bus);
@@ -1458,20 +1469,44 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
return -EBUSY;
}
 
+   /* Allocating pe_num_map */
+   if (pdn->m64_single_mode)
+   pdn->pe_num_map = kmalloc(sizeof(*pdn->pe_num_map) * 
num_vfs,
+   GFP_KERNEL);
+   else
+   pdn->pe_num_map = kmalloc(sizeof(*pdn->pe_num_map), 
GFP_KERNEL);
+
+   if (!pdn->pe_num_map)
+   return -ENOMEM;
+
+   if (pdn->m64_single_mode)

[PATCH V7 4/6] powerpc/powernv: replace the hard coded boundary with gate

2015-10-21 Thread Wei Yang
At the moment 64bit-prefetchable window can be maximum 64GB, which is
currently got from device tree. This means that in shared mode the maximum
supported VF BAR size is 64GB/256=256MB. While this size could exhaust the
whole 64bit-prefetchable window. This is a design decision to set a
boundary to 64MB of the VF BAR size. Since VF BAR size with 64MB would
occupy a quarter of the 64bit-prefetchable window, this is affordable.

This patch replaces magic limit of 64MB with "gate", which is 1/4 of the
M64 Segment Size(m64_segsize >> 2) and adds comment to explain the reason
for it.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vent.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f867a9b..0add35f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2696,8 +2696,9 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { 
}
 #ifdef CONFIG_PCI_IOV
 static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 {
-   struct pci_controller *hose;
-   struct pnv_phb *phb;
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   const resource_size_t gate = phb->ioda.m64_segsize >> 2;
struct resource *res;
int i;
resource_size_t size;
@@ -2707,9 +2708,6 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
if (!pdev->is_physfn || pdev->is_added)
return;
 
-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-
pdn = pci_get_pdn(pdev);
pdn->vfs_expanded = 0;
pdn->m64_single_mode = false;
@@ -2730,10 +2728,22 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
 
size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
 
-   /* bigger than 64M */
-   if (size > (1 << 26)) {
-   dev_info(>dev, "PowerNV: VF BAR%d: %pR IOV size 
is bigger than 64M, roundup power2\n",
-i, res);
+   /*
+* If bigger than quarter of M64 segment size, just round up
+* power of two.
+*
+* Generally, one M64 BAR maps one IOV BAR. To avoid conflict
+* with other devices, IOV BAR size is expanded to be
+* (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
+* segment size , the expanded size would equal to half of the
+* whole M64 space size, which will exhaust the M64 Space and
+* limit the system flexibility.  This is a design decision to
+* set the boundary to quarter of the M64 segment size.
+*/
+   if (size > gate) {
+   dev_info(>dev, "PowerNV: VF BAR%d: %pR IOV size "
+   "is bigger than %lld, roundup power2\n",
+i, res, gate);
mul = roundup_pow_of_two(total_vfs);
pdn->m64_single_mode = true;
break;
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V7 1/6] powerpc/powernv: don't enable SRIOV when VF BAR has non 64bit-prefetchable BAR

2015-10-21 Thread Wei Yang
On PHB3, we enable SRIOV devices by mapping IOV BAR with M64 BARs. If a
SRIOV device's IOV BAR is not 64bit-prefetchable, this is not assigned from
64bit prefetchable window, which means M64 BAR can't work on it.

The reason is PCI bridges support only 2 memory windows and the kernel code
programs bridges in the way that one window is 32bit-nonprefetchable and
the other one is 64bit-prefetchable. So if devices' IOV BAR is 64bit and
non-prefetchable, it will be mapped into 32bit space and therefore M64
cannot be used for it.

This patch makes this explicit and truncate IOV resource in this case to
save MMIO space.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 +--
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 85cbc96..02324c6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -908,9 +908,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
if (!res->flags || !res->parent)
continue;
 
-   if (!pnv_pci_is_mem_pref_64(res->flags))
-   continue;
-
/*
 * The actual IOV BAR range is determined by the start address
 * and the actual size for num_vfs VFs BAR.  This check is to
@@ -939,9 +936,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
if (!res->flags || !res->parent)
continue;
 
-   if (!pnv_pci_is_mem_pref_64(res->flags))
-   continue;
-
size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
res2 = *res;
res->start += size * offset;
@@ -1221,9 +1215,6 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
if (!res->flags || !res->parent)
continue;
 
-   if (!pnv_pci_is_mem_pref_64(res->flags))
-   continue;
-
for (j = 0; j < vf_groups; j++) {
do {
win = 
find_next_zero_bit(>ioda.m64_bar_alloc,
@@ -1510,6 +1501,12 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
pdn = pci_get_pdn(pdev);
 
if (phb->type == PNV_PHB_IODA2) {
+   if (!pdn->vfs_expanded) {
+   dev_info(>dev, "don't support this SRIOV device"
+   " with non 64bit-prefetchable IOV BAR\n");
+   return -ENOSPC;
+   }
+
/* Calculate available PE for required VFs */
mutex_lock(>ioda.pe_alloc_mutex);
pdn->offset = bitmap_find_next_zero_area(
@@ -2775,9 +2772,10 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
if (!res->flags || res->parent)
continue;
if (!pnv_pci_is_mem_pref_64(res->flags)) {
-   dev_warn(>dev, " non M64 VF BAR%d: %pR\n",
+   dev_warn(>dev, "Don't support SR-IOV with"
+   " non M64 VF BAR%d: %pR. \n",
 i, res);
-   continue;
+   goto truncate_iov;
}
 
size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
@@ -2796,11 +2794,6 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
res = >resource[i + PCI_IOV_RESOURCES];
if (!res->flags || res->parent)
continue;
-   if (!pnv_pci_is_mem_pref_64(res->flags)) {
-   dev_warn(>dev, "Skipping expanding VF BAR%d: 
%pR\n",
-i, res);
-   continue;
-   }
 
dev_dbg(>dev, " Fixing VF BAR%d: %pR to\n", i, res);
size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
@@ -2810,6 +2803,16 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
 i, res, mul);
}
pdn->vfs_expanded = mul;
+
+   return;
+
+truncate_iov:
+   /* To save MMIO space, IOV BAR is truncated. */
+   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+   res = >resource[i + PCI_IOV_RESOURCES];
+   res->flags = 0;
+   res->end = res->start - 1;
+   }
 }
 #endif /* CONFIG_PCI_IOV */
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V7 2/6] powerpc/powernv: simplify the calculation of iov resource alignment

2015-10-21 Thread Wei Yang
The alignment of IOV BAR on PowerNV platform is the total size of the IOV
BAR. No matter whether the IOV BAR is extended with number of
roundup_pow_of_two(total_vfs) or number of max PE number (256), the total
size could be calculated by (vfs_expanded * VF_BAR_size).

This patch simplifies the pnv_pci_iov_resource_alignment() by removing the
first case.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 02324c6..dc0c90b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2998,17 +2998,21 @@ static resource_size_t 
pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
  int resno)
 {
struct pci_dn *pdn = pci_get_pdn(pdev);
-   resource_size_t align, iov_align;
-
-   iov_align = resource_size(>resource[resno]);
-   if (iov_align)
-   return iov_align;
+   resource_size_t align;
 
+   /*
+* On PowerNV platform, IOV BAR is mapped by M64 BAR to enable the
+* SR-IOV. While from hardware perspective, the range mapped by M64
+* BAR should be size aligned.
+*
+* This function returns the total IOV BAR size if M64 BAR is in
+* Shared PE mode or just VF BAR size if not.
+*/
align = pci_iov_resource_size(pdev, resno);
-   if (pdn->vfs_expanded)
-   return pdn->vfs_expanded * align;
+   if (!pdn->vfs_expanded)
+   return align;
 
-   return align;
+   return pdn->vfs_expanded * align;
 }
 #endif /* CONFIG_PCI_IOV */
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V7 3/6] powerpc/powernv: use one M64 BAR in Single PE mode for one VF BAR

2015-10-21 Thread Wei Yang
In current implementation, when VF BAR is bigger than 64MB, it uses 4 M64
BARs in Single PE mode to cover the number of VFs required to be enabled.
By doing so, several VFs would be in one VF Group and leads to interference
between VFs in the same group.

And in this patch, m64_wins is renamed to m64_map, which means index number
of the M64 BAR used to map the VF BAR. Based on Gavin's comments. Also
makes sure the VF BAR size is bigger than 32MB when M64 BAR is used in
Single PE mode.

This patch changes the design by using one M64 BAR in Single PE mode for
one VF BAR. This gives absolute isolation for VFs.

Signed-off-by: Wei Yang <weiy...@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gws...@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/include/asm/pci-bridge.h |   5 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 177 --
 2 files changed, 75 insertions(+), 107 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 712add5..8aeba4c 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -214,10 +214,9 @@ struct pci_dn {
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
int offset; /* PE# for the first VF PE */
-#define M64_PER_IOV 4
-   int m64_per_iov;
+   boolm64_single_mode;/* Use M64 BAR in Single Mode */
 #define IODA_INVALID_M64(-1)
-   int m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
+   int (*m64_map)[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index dc0c90b..f867a9b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1148,29 +1148,36 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
-static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
 {
struct pci_bus*bus;
struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
inti, j;
+   intm64_bars;
 
bus = pdev->bus;
hose = pci_bus_to_host(bus);
phb = hose->private_data;
pdn = pci_get_pdn(pdev);
 
+   if (pdn->m64_single_mode)
+   m64_bars = num_vfs;
+   else
+   m64_bars = 1;
+
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-   for (j = 0; j < M64_PER_IOV; j++) {
-   if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+   for (j = 0; j < m64_bars; j++) {
+   if (pdn->m64_map[j][i] == IODA_INVALID_M64)
continue;
opal_pci_phb_mmio_enable(phb->opal_id,
-   OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
-   clear_bit(pdn->m64_wins[i][j], 
>ioda.m64_bar_alloc);
-   pdn->m64_wins[i][j] = IODA_INVALID_M64;
+   OPAL_M64_WINDOW_TYPE, pdn->m64_map[j][i], 0);
+   clear_bit(pdn->m64_map[j][i], >ioda.m64_bar_alloc);
+   pdn->m64_map[j][i] = IODA_INVALID_M64;
}
 
+   kfree(pdn->m64_map);
return 0;
 }
 
@@ -1187,8 +1194,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
inttotal_vfs;
resource_size_tsize, start;
intpe_num;
-   intvf_groups;
-   intvf_per_group;
+   intm64_bars;
 
bus = pdev->bus;
hose = pci_bus_to_host(bus);
@@ -1196,26 +1202,26 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
pdn = pci_get_pdn(pdev);
total_vfs = pci_sriov_get_totalvfs(pdev);
 
-   /* Initialize the m64_wins to IODA_INVALID_M64 */
-   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-   for (j = 0; j < M64_PER_IOV; j++)
-   pdn->m64_wins[i][j] = IODA_INVALID_M64;
+   if (pdn->m64_single_mode)
+   m64_bars = num_vfs;
+   else
+   m64_bars = 1;
+
+   pdn->m64_map = kmalloc(sizeof(*pdn->m64_map) * m64_bars, GFP_KERNEL);
+   if (!pdn->m64_map)
+   return -ENOMEM;
+   /* Initialize the m64_map to IODA_INVALID_M64 */
+   for (i = 0; i < m64_bars ; i++)
+   for (j = 0; j < PCI_SRIOV_NUM_BARS; j++)
+   pdn->m64_map[i][j] = IODA_INVALID_M64;
 
-   if (pdn-&

[PATCH V7 0/6] Redesign SR-IOV on PowerNV

2015-10-21 Thread Wei Yang
In original design, it tries to group VFs to enable more number of VFs in the
system, when VF BAR is bigger than 64MB. This design has a flaw in which one
error on a VF will interfere other VFs in the same group.

This patch series change this design by using M64 BAR in Single PE mode to
cover only one VF BAR. By doing so, it gives absolute isolation between VFs.

v7:
   * clear res->flags when truncating the IOV BAR
v6:
   * add the minimum size check when M64 BAR is in Single PE mode
   * truncate IOV BAR when powernv can't handle it
v5:
   * rebase on top of v4.2, with commit 68230242cdb "net/mlx4_core: Add port
 attribute when tracking counters" reverted
   * add some reason in change log of Patch 1
   * make the pnv_pci_iov_resource_alignment() more easy to read
   * initialize pe_num_map[] just after it is allocated
   * test ssh from guest to host via VF passed and then shutdown the guest
   * no code change
v4:
   * rebase the code on top of v4.2-rc7
   * switch back to use the dynamic version of pe_num_map and m64_map
   * split the memory allocation and PE assignment of pe_num_map to make it
 more easy to read
   * check pe_num_map value before free PE.
   * add the rename reason for pe_num_map and m64_map in change log
v3:
   * return -ENOSPC when a VF has non-64bit prefetchable BAR
   * rename offset to pe_num_map and define it staticly
   * change commit log based on comments
   * define m64_map staticly
v2:
   * clean up iov bar alignment calculation
   * change m64s to m64_bars
   * add a field to represent M64 Single PE mode will be used
   * change m64_wins to m64_map
   * calculate the gate instead of hard coded
   * dynamically allocate m64_map
   * dynamically allocate PE#
   * add a case to calculate iov bar alignment when M64 Single PE is used
   * when M64 Single PE is used, compare num_vfs with M64 BAR available number 
 in system at first


Wei Yang (6):
  powerpc/powernv: don't enable SRIOV when VF BAR has non
64bit-prefetchable BAR
  powerpc/powernv: simplify the calculation of iov resource alignment
  powerpc/powernv: use one M64 BAR in Single PE mode for one VF BAR
  powerpc/powernv: replace the hard coded boundary with gate
  powerpc/powernv: boundary the total VF BAR size instead of the
individual one
  powerpc/powernv: allocate sparse PE# when using M64 BAR in Single PE
mode

 arch/powerpc/include/asm/pci-bridge.h |   7 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 347 --
 2 files changed, 192 insertions(+), 162 deletions(-)

-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

  1   2   3   4   5   6   7   >