Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-15 Thread Takao Indoh
(2013/05/16 1:59), Sumner, William wrote:
> Thank you for sending the 2-line addition to your patch that helps me test 
> the suggestions in the PCIe standard.
> 
> I started with a new Linux top-of-tree as of last week, added the patch from 
> late April, added these new lines of code, then tested on one of the 
> platforms that showed the problem before.
> 
> This addition seems to improve operation on my platform.  Previously the 
> lock-up occurred immediately when the problem PCI card was reset.  With these 
> two lines added, the patch code completed: all resets occurred, all status 
> was restored to PCI cards, and the crash kernel began loading modules.  The 
> kernel then stopped after loading a few modules -- around 5 or 6 modules.  
> This was consistent over a few tries.  Using the same kernel, eliminating 
> only the "pci=pcie_reset_devices" from the crash kernel's command line, a 
> dump succeeds.  I am still investigating why the crash kernel stops during 
> module loading.
> 
> Reading other reset code in Linux: pci_reset_function() writes 
> PCI_COMMAND_INTX_DISABLE into the command register instead of writing 0.  
> This clears all bits and sets the INTX_DISABLE bit.  So I tried that in the 
> patch and the results were the same as in the paragraph above.

Thank you for testing. Hmm, clearing command register is effective but
something else seems to be also needed.

> Two questions:
> 1. Since the two added lines prevent the PCIe device from initiating new bus 
> activity including DMA IO, and also prevent the PCIe device from sending 
> interrupts; is this not sufficient (without resetting the PCIe device) to 
> solve the original problem of leftover ongoing DMA IO when the crash kernel 
> resets the hardware IOMMU ?
> 
> 1a. My guess is that the answer to question 1 is "No" because some PCIe card 
> may not check the bus master flag just before each bus transaction -- its 
> engineers having interpreted the standard differently than I do.  However, 
> this seems like a question worth asking of the community.

>From my experience, my answer is also "No".  On certain raid card test,
DMA remapping fault or PCI SERR are still detected after just clearing
bus master bit, though it may be a problem of its driver or hardware.


> 2. Linux has a set of functions to reset PCIe devices beginning at " int 
> pci_reset_function(struct pci_dev *dev) " and calling down to 
> "__pci_dev_reset"  which tries several different methods to reset the PCI 
> device.  Would any of these be useful additions to the proposed patch ?

Good point. When I started this work, at first I made a patch to reset
devices like __pci_dev_reset. As a result, it did not work for a certain
raid card. In this case, pcie_flr() is called in __pci_dev_reset() and
it returned true, but DMAR error occurred after its driver was loaded.

One of ideas is adding boot parameter so user can choose appropriate
method: just clearing bus master, FLR, hot-reset, etc.

Thanks,
Takao Indoh


> 
> Bill Sumner
> 
> 
> 
> -Original Message-
> From: Takao Indoh [mailto:indou.ta...@jp.fujitsu.com]
> Sent: Tuesday, May 07, 2013 2:10 AM
> To: Sumner, William
> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
> linux-kernel@vger.kernel.org; tin...@gmail.com; 
> io...@lists.linux-foundation.org; bhelg...@google.com; ddut...@redhat.com
> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
> 
> Sorry for the delayed response.
> 
> (2013/04/30 23:54), Sumner, William wrote:
>> I have installed your original patch set (from last November) and tested 
>> with three platforms, each with a different IO configuration.  On the first 
>> platform crashdumps were consistently successful.  On the second and third 
>> platforms, the reset of one specific PCI device on each platform (a 
>> different device type on each platform) immediately hung the crash kernel. 
>> This was consistent across multiple tries.  Temporarily modifying the patch 
>> to skip resetting the one problem-device on each platform allowed each 
>> platform to create a dump successfully.
>>
>> I am now working with our IO team to determine the exact nature of the hang 
>> and to see if the hang is unique to the specific cards or the specific 
>> platforms.
>>
>> Also I am starting to look at an alternate possibility:
>>
>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>> Reset) describes reset use-cases for "a partitioned environment where 
>> hardware is migrated from one partition to another" and for "system software 
>> is taking down the software stack for a Function and then rebuilding that 
>> s

RE: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-15 Thread Sumner, William
Thank you for sending the 2-line addition to your patch that helps me test the 
suggestions in the PCIe standard.

I started with a new Linux top-of-tree as of last week, added the patch from 
late April, added these new lines of code, then tested on one of the platforms 
that showed the problem before.  

This addition seems to improve operation on my platform.  Previously the 
lock-up occurred immediately when the problem PCI card was reset.  With these 
two lines added, the patch code completed: all resets occurred, all status was 
restored to PCI cards, and the crash kernel began loading modules.  The kernel 
then stopped after loading a few modules -- around 5 or 6 modules.  This was 
consistent over a few tries.  Using the same kernel, eliminating only the 
"pci=pcie_reset_devices" from the crash kernel's command line, a dump succeeds. 
 I am still investigating why the crash kernel stops during module loading.

Reading other reset code in Linux: pci_reset_function() writes 
PCI_COMMAND_INTX_DISABLE into the command register instead of writing 0.  This 
clears all bits and sets the INTX_DISABLE bit.  So I tried that in the patch 
and the results were the same as in the paragraph above.

Two questions:
1. Since the two added lines prevent the PCIe device from initiating new bus 
activity including DMA IO, and also prevent the PCIe device from sending 
interrupts; is this not sufficient (without resetting the PCIe device) to solve 
the original problem of leftover ongoing DMA IO when the crash kernel resets 
the hardware IOMMU ?

1a. My guess is that the answer to question 1 is "No" because some PCIe card 
may not check the bus master flag just before each bus transaction -- its 
engineers having interpreted the standard differently than I do.  However, this 
seems like a question worth asking of the community.  


2. Linux has a set of functions to reset PCIe devices beginning at " int 
pci_reset_function(struct pci_dev *dev) " and calling down to "__pci_dev_reset" 
 which tries several different methods to reset the PCI device.  Would any of 
these be useful additions to the proposed patch ?  

Bill Sumner 



-Original Message-
From: Takao Indoh [mailto:indou.ta...@jp.fujitsu.com] 
Sent: Tuesday, May 07, 2013 2:10 AM
To: Sumner, William
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com; ddut...@redhat.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Sorry for the delayed response.

(2013/04/30 23:54), Sumner, William wrote:
> I have installed your original patch set (from last November) and tested with 
> three platforms, each with a different IO configuration.  On the first 
> platform crashdumps were consistently successful.  On the second and third 
> platforms, the reset of one specific PCI device on each platform (a different 
> device type on each platform) immediately hung the crash kernel. This was 
> consistent across multiple tries.  Temporarily modifying the patch to skip 
> resetting the one problem-device on each platform allowed each platform to 
> create a dump successfully.
> 
> I am now working with our IO team to determine the exact nature of the hang 
> and to see if the hang is unique to the specific cards or the specific 
> platforms.
> 
> Also I am starting to look at an alternate possibility:
> 
> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
> Reset) describes reset use-cases for "a partitioned environment where 
> hardware is migrated from one partition to another" and for "system software 
> is taking down the software stack for a Function and then rebuilding that 
> stack".
> 
> These use-cases seem very similar to transitioning into the crash kernel.  
> The "Implementation Note" at the end of that section describes an algorithm 
> for "Avoiding Data Corruption From Stale Completions" which looks like it 
> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
> of the steps from this algorithm to one of the already proposed patches would 
> avoid the hang that I saw on two platforms.

It seems that the algorithm you mentioned requires four steps.

1. Clear Command register
2. Wait a few milliseconds (Or polling Transactions Pending bit)
3. Do FLR
4. Wait 100 ms

My patch does not do step 1 and 2. So, I would appreciate it if you
could add the following into save_config() in my latest patch and
confirm if kernel still hangs up or not.

subordinate = dev->subordinate;
list_for_each_entry(child, >devices, bus_list) {
dev_info(>dev, "save state\n");
pci_save_state(child);
+   pci_write_config_word(child, PCI_COMMAND, 0);
+   msleep(1000);

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-15 Thread Takao Indoh
(2013/05/16 1:59), Sumner, William wrote:
 Thank you for sending the 2-line addition to your patch that helps me test 
 the suggestions in the PCIe standard.
 
 I started with a new Linux top-of-tree as of last week, added the patch from 
 late April, added these new lines of code, then tested on one of the 
 platforms that showed the problem before.
 
 This addition seems to improve operation on my platform.  Previously the 
 lock-up occurred immediately when the problem PCI card was reset.  With these 
 two lines added, the patch code completed: all resets occurred, all status 
 was restored to PCI cards, and the crash kernel began loading modules.  The 
 kernel then stopped after loading a few modules -- around 5 or 6 modules.  
 This was consistent over a few tries.  Using the same kernel, eliminating 
 only the pci=pcie_reset_devices from the crash kernel's command line, a 
 dump succeeds.  I am still investigating why the crash kernel stops during 
 module loading.
 
 Reading other reset code in Linux: pci_reset_function() writes 
 PCI_COMMAND_INTX_DISABLE into the command register instead of writing 0.  
 This clears all bits and sets the INTX_DISABLE bit.  So I tried that in the 
 patch and the results were the same as in the paragraph above.

Thank you for testing. Hmm, clearing command register is effective but
something else seems to be also needed.

 Two questions:
 1. Since the two added lines prevent the PCIe device from initiating new bus 
 activity including DMA IO, and also prevent the PCIe device from sending 
 interrupts; is this not sufficient (without resetting the PCIe device) to 
 solve the original problem of leftover ongoing DMA IO when the crash kernel 
 resets the hardware IOMMU ?
 
 1a. My guess is that the answer to question 1 is No because some PCIe card 
 may not check the bus master flag just before each bus transaction -- its 
 engineers having interpreted the standard differently than I do.  However, 
 this seems like a question worth asking of the community.

From my experience, my answer is also No.  On certain raid card test,
DMA remapping fault or PCI SERR are still detected after just clearing
bus master bit, though it may be a problem of its driver or hardware.


 2. Linux has a set of functions to reset PCIe devices beginning at  int 
 pci_reset_function(struct pci_dev *dev)  and calling down to 
 __pci_dev_reset  which tries several different methods to reset the PCI 
 device.  Would any of these be useful additions to the proposed patch ?

Good point. When I started this work, at first I made a patch to reset
devices like __pci_dev_reset. As a result, it did not work for a certain
raid card. In this case, pcie_flr() is called in __pci_dev_reset() and
it returned true, but DMAR error occurred after its driver was loaded.

One of ideas is adding boot parameter so user can choose appropriate
method: just clearing bus master, FLR, hot-reset, etc.

Thanks,
Takao Indoh


 
 Bill Sumner
 
 
 
 -Original Message-
 From: Takao Indoh [mailto:indou.ta...@jp.fujitsu.com]
 Sent: Tuesday, May 07, 2013 2:10 AM
 To: Sumner, William
 Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
 linux-kernel@vger.kernel.org; tin...@gmail.com; 
 io...@lists.linux-foundation.org; bhelg...@google.com; ddut...@redhat.com
 Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
 
 Sorry for the delayed response.
 
 (2013/04/30 23:54), Sumner, William wrote:
 I have installed your original patch set (from last November) and tested 
 with three platforms, each with a different IO configuration.  On the first 
 platform crashdumps were consistently successful.  On the second and third 
 platforms, the reset of one specific PCI device on each platform (a 
 different device type on each platform) immediately hung the crash kernel. 
 This was consistent across multiple tries.  Temporarily modifying the patch 
 to skip resetting the one problem-device on each platform allowed each 
 platform to create a dump successfully.

 I am now working with our IO team to determine the exact nature of the hang 
 and to see if the hang is unique to the specific cards or the specific 
 platforms.

 Also I am starting to look at an alternate possibility:

 The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
 Reset) describes reset use-cases for a partitioned environment where 
 hardware is migrated from one partition to another and for system software 
 is taking down the software stack for a Function and then rebuilding that 
 stack.

 These use-cases seem very similar to transitioning into the crash kernel.  
 The Implementation Note at the end of that section describes an algorithm 
 for Avoiding Data Corruption From Stale Completions which looks like it 
 might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
 of the steps from this algorithm to one of the already proposed patches 
 would avoid the hang that I saw on two platforms.
 
 It seems

RE: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-15 Thread Sumner, William
Thank you for sending the 2-line addition to your patch that helps me test the 
suggestions in the PCIe standard.

I started with a new Linux top-of-tree as of last week, added the patch from 
late April, added these new lines of code, then tested on one of the platforms 
that showed the problem before.  

This addition seems to improve operation on my platform.  Previously the 
lock-up occurred immediately when the problem PCI card was reset.  With these 
two lines added, the patch code completed: all resets occurred, all status was 
restored to PCI cards, and the crash kernel began loading modules.  The kernel 
then stopped after loading a few modules -- around 5 or 6 modules.  This was 
consistent over a few tries.  Using the same kernel, eliminating only the 
pci=pcie_reset_devices from the crash kernel's command line, a dump succeeds. 
 I am still investigating why the crash kernel stops during module loading.

Reading other reset code in Linux: pci_reset_function() writes 
PCI_COMMAND_INTX_DISABLE into the command register instead of writing 0.  This 
clears all bits and sets the INTX_DISABLE bit.  So I tried that in the patch 
and the results were the same as in the paragraph above.

Two questions:
1. Since the two added lines prevent the PCIe device from initiating new bus 
activity including DMA IO, and also prevent the PCIe device from sending 
interrupts; is this not sufficient (without resetting the PCIe device) to solve 
the original problem of leftover ongoing DMA IO when the crash kernel resets 
the hardware IOMMU ?

1a. My guess is that the answer to question 1 is No because some PCIe card 
may not check the bus master flag just before each bus transaction -- its 
engineers having interpreted the standard differently than I do.  However, this 
seems like a question worth asking of the community.  


2. Linux has a set of functions to reset PCIe devices beginning at  int 
pci_reset_function(struct pci_dev *dev)  and calling down to __pci_dev_reset 
 which tries several different methods to reset the PCI device.  Would any of 
these be useful additions to the proposed patch ?  

Bill Sumner 



-Original Message-
From: Takao Indoh [mailto:indou.ta...@jp.fujitsu.com] 
Sent: Tuesday, May 07, 2013 2:10 AM
To: Sumner, William
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com; ddut...@redhat.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Sorry for the delayed response.

(2013/04/30 23:54), Sumner, William wrote:
 I have installed your original patch set (from last November) and tested with 
 three platforms, each with a different IO configuration.  On the first 
 platform crashdumps were consistently successful.  On the second and third 
 platforms, the reset of one specific PCI device on each platform (a different 
 device type on each platform) immediately hung the crash kernel. This was 
 consistent across multiple tries.  Temporarily modifying the patch to skip 
 resetting the one problem-device on each platform allowed each platform to 
 create a dump successfully.
 
 I am now working with our IO team to determine the exact nature of the hang 
 and to see if the hang is unique to the specific cards or the specific 
 platforms.
 
 Also I am starting to look at an alternate possibility:
 
 The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
 Reset) describes reset use-cases for a partitioned environment where 
 hardware is migrated from one partition to another and for system software 
 is taking down the software stack for a Function and then rebuilding that 
 stack.
 
 These use-cases seem very similar to transitioning into the crash kernel.  
 The Implementation Note at the end of that section describes an algorithm 
 for Avoiding Data Corruption From Stale Completions which looks like it 
 might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
 of the steps from this algorithm to one of the already proposed patches would 
 avoid the hang that I saw on two platforms.

It seems that the algorithm you mentioned requires four steps.

1. Clear Command register
2. Wait a few milliseconds (Or polling Transactions Pending bit)
3. Do FLR
4. Wait 100 ms

My patch does not do step 1 and 2. So, I would appreciate it if you
could add the following into save_config() in my latest patch and
confirm if kernel still hangs up or not.

subordinate = dev-subordinate;
list_for_each_entry(child, subordinate-devices, bus_list) {
dev_info(child-dev, save state\n);
pci_save_state(child);
+   pci_write_config_word(child, PCI_COMMAND, 0);
+   msleep(1000);
}

Thanks,
Takao Indoh


 
 Bill Sumner
 
 -Original Message-
 From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
 Sent: Friday, April 26, 2013 8:43 AM
 To: Takao Indoh

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/04/26 3:01), Don Dutile wrote:
> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>> (2013/04/25 4:59), Don Dutile wrote:
>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
   dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

>>> Couple questions wrt VGA device:
>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>is the VGA always function 0, so this scan sees it first&  doesn't
>>>do a reset on that PCIe link?  if the VGA is not function 0, won't
>>>this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>
>> VGA is not reset irrespective of its function number. The logic of this
>> patch is:
>>
>> for_each_pci_dev(dev) {
>>   if (dev is not PCIe)
>>  continue;
>>   if (dev is not root port/downstream port) ---(1)
>>  continue;
>>   list_for_each_entry(child,>subordinate->devices, bus_list) {
>>   if (child is upstream port or bridge or VGA) ---(2)
>>   continue;
>>   }
>>   do_reset_its_child(dev);
>> }
>>
>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>> downstream port) of VGA is also skipped by (2).
>>
>>
>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>assumes it is, and why it isn't blanked.
>>>Q: Should the filter be based on a device having a device-class of 
>>> display ?
>>
>> I want to avoid the situation that user's monitor blacks out and user
>> cannot know what's going on. That's reason why I introduced the logic to
>> skip VGA. As far as I tested the logic based on device-class works well,
> sorry, I read your description, which said VGA, but your are filtering on 
> display class,
> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
> display devices
> are probably one of the most aggressive DMA engines on a system and will 
> grow as
> asymmetric processing using GPUs gets architected into a device-agnostic 
> manner.
> So, this may work well for servers, which is the primary consumer/user of 
> this feature,
> and they typically have built-in graphics that are generally used in simple 
> VGA mode,
> so this may be sufficient for now.
>   
> 
>> but I would appreciate it if there are better ways.
>>
> You probably don't want to hear it but
> a) only turn off cmd-reg master enable bit
> b) only do reset based on a list of devices known not to
> obey their cmd-reg master enable bit, and only do reset to those devices.
> But, given the testing you've done so far, this optional (need cmdline) 
> feature,
> let's start here.
> 
>>>
 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
>>> Thanks for this re-start.  I need to continue reviewing the rest.
>>
>> Thank you for your review!
>>
>>>
>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
>>>   dump?  After the crash dump, the system is rebooting to previous 
>>> (iommu=on) setting.
>>>   That logic, along w/your previous patch to disable the IOMMU if 
>>> iommu=off
>>>   is set, would remove this (relatively slow) PCI init sequencing ?
>>
>> To force iommu off, all ongoing DMA have to be stopped before that since
>> they are accessing the device address, not physical address. If we disable
>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>> and it causes memory corruption or PCI-SERR due to DMA error.
> Right, that's a 'duh' on my part.
> I thought 'disable iommu' == 'block all dma' and it just turns it off &
> let's the ongoing DMA run...
> Please ignore this question... 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/05/08 7:04), Alex Williamson wrote:
> On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote:
>> On 05/07/2013 12:39 PM, Alex Williamson wrote:
>>> On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
  and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
  dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 https://lkml.org/lkml/2012/11/26/814

 Signed-off-by: Takao Indoh
 ---
Documentation/kernel-parameters.txt |2 +
drivers/pci/pci.c   |  103 
 +++
2 files changed, 105 insertions(+), 0 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 4609e81..2a31ade 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
any pair of devices, possibly at the 
 cost of
reduced performance.  This also 
 guarantees
that hot-added devices will work.
 +  pcie_reset_devices  Reset PCIe endpoint on boot by hot
 +  reset
cbiosize=nn[KMG]The fixed amount of bus space 
 which is
reserved for the CardBus bridge's IO 
 window.
The default value is 256 bytes.
 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index b099e00..42385c9 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
}
EXPORT_SYMBOL(pci_fixup_cardbus);

 +/*
 + * Return true if dev is PCIe root port or downstream port whose child is 
 PCIe
 + * endpoint except VGA device.
 + */
 +static int __init need_reset(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!pci_is_pcie(dev) || !dev->subordinate ||
 +  list_empty(>subordinate->devices) ||
 +  ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
 +   (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
 +  return 0;
 +
 +  subordinate = dev->subordinate;
 +  list_for_each_entry(child,>devices, bus_list) {
 +  if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
 +  (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
 +  ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
 +  /* Don't reset switch, bridge, VGA device */
 +  return 0;
 +  }
 +
 +  return 1;
 +}
 +
 +static void __init save_config(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!need_reset(dev))
 +  return;
 +
 +  subordinate = dev->subordinate;
 +  list_for_each_entry(child,>devices, bus_list) {
 +  dev_info(>dev, "save state\n");
 +  pci_save_state(child);
 +  }
 +}
 +
 +static void __init restore_config(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!need_reset(dev))
 +  return;
 +
 +  subordinate = 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/05/07 21:50), Don Dutile wrote:
> On 05/07/2013 03:09 AM, Takao Indoh wrote:
>> Sorry for the delayed response.
>>
>> (2013/04/30 23:54), Sumner, William wrote:
>>> I have installed your original patch set (from last November) and tested 
>>> with three platforms, each with a different IO configuration.  On the first 
>>> platform crashdumps were consistently successful.  On the second and third 
>>> platforms, the reset of one specific PCI device on each platform (a 
>>> different device type on each platform) immediately hung the crash kernel. 
>>> This was consistent across multiple tries.  Temporarily modifying the patch 
>>> to skip resetting the one problem-device on each platform allowed each 
>>> platform to create a dump successfully.
>>>
>>> I am now working with our IO team to determine the exact nature of the hang 
>>> and to see if the hang is unique to the specific cards or the specific 
>>> platforms.
>>>
>>> Also I am starting to look at an alternate possibility:
>>>
>>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>>> Reset) describes reset use-cases for "a partitioned environment where 
>>> hardware is migrated from one partition to another" and for "system 
>>> software is taking down the software stack for a Function and then 
>>> rebuilding that stack".
>>>
>>> These use-cases seem very similar to transitioning into the crash kernel.  
>>> The "Implementation Note" at the end of that section describes an algorithm 
>>> for "Avoiding Data Corruption From Stale Completions" which looks like it 
>>> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
>>> of the steps from this algorithm to one of the already proposed patches 
>>> would avoid the hang that I saw on two platforms.
>>
>> It seems that the algorithm you mentioned requires four steps.
>>
>> 1. Clear Command register
>> 2. Wait a few milliseconds (Or polling Transactions Pending bit)
>> 3. Do FLR
>> 4. Wait 100 ms
>>
>> My patch does not do step 1 and 2. So, I would appreciate it if you
>> could add the following into save_config() in my latest patch and
>> confirm if kernel still hangs up or not.
>>
>>   subordinate = dev->subordinate;
>>   list_for_each_entry(child,>devices, bus_list) {
>>   dev_info(>dev, "save state\n");
>>   pci_save_state(child);
>> +   pci_write_config_word(child, PCI_COMMAND, 0);
>> +   msleep(1000);
> As Linus pointed out in an earlier patch, msleep() after each device
> is s-l-o-w and unnecessary; should do a single msleep(1000) after
> *all* the command registers are written.  That way, the added delay is 1sec,
> not 1sec*ndevs.

Yeah, of course. I added these two lines just to confirm if clearing
command register resolves Bill's problem. They are tentative things.

> q: is this turning off command register for only PCI(e) endpoints?
> -- shouldn't have to turn off command register in bridges/switches.

Yes, only clearing register of endpoints because kdump fails due to
following error when turning off command register of switch as well.

Fusion MPT SAS Host driver 3.04.20
mptbase: ioc0: Initiating bringup
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: ERROR - Failed to come READY after reset! IocState=f000
mptbase: ioc0: WARNING - ResetHistory bit failed to clear!
mptbase: ioc0: ERROR - Diagnostic reset FAILED! (h)
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: ioc0: ERROR - didn't initialize properly! (-1)
mptsas: probe of :50:00.0 failed with error -1

Thanks,
Takao Indoh

> 
>>       }
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>>
>>> Bill Sumner
>>>
>>> -Original Message-
>>> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don 
>>> Dutile
>>> Sent: Friday, April 26, 2013 8:43 AM
>>> To: Takao Indoh
>>> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
>>> linux-kernel@vger.kernel.org; tin...@gmail.com; 
>>> io...@lists.linux-foundation.org; bhelg...@google.com
>>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>>>
>>> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>>>> (2013/04/26 3:01), Don Dutile wrote:
>>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>>>> On 04

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/05/07 21:50), Don Dutile wrote:
 On 05/07/2013 03:09 AM, Takao Indoh wrote:
 Sorry for the delayed response.

 (2013/04/30 23:54), Sumner, William wrote:
 I have installed your original patch set (from last November) and tested 
 with three platforms, each with a different IO configuration.  On the first 
 platform crashdumps were consistently successful.  On the second and third 
 platforms, the reset of one specific PCI device on each platform (a 
 different device type on each platform) immediately hung the crash kernel. 
 This was consistent across multiple tries.  Temporarily modifying the patch 
 to skip resetting the one problem-device on each platform allowed each 
 platform to create a dump successfully.

 I am now working with our IO team to determine the exact nature of the hang 
 and to see if the hang is unique to the specific cards or the specific 
 platforms.

 Also I am starting to look at an alternate possibility:

 The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
 Reset) describes reset use-cases for a partitioned environment where 
 hardware is migrated from one partition to another and for system 
 software is taking down the software stack for a Function and then 
 rebuilding that stack.

 These use-cases seem very similar to transitioning into the crash kernel.  
 The Implementation Note at the end of that section describes an algorithm 
 for Avoiding Data Corruption From Stale Completions which looks like it 
 might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
 of the steps from this algorithm to one of the already proposed patches 
 would avoid the hang that I saw on two platforms.

 It seems that the algorithm you mentioned requires four steps.

 1. Clear Command register
 2. Wait a few milliseconds (Or polling Transactions Pending bit)
 3. Do FLR
 4. Wait 100 ms

 My patch does not do step 1 and 2. So, I would appreciate it if you
 could add the following into save_config() in my latest patch and
 confirm if kernel still hangs up or not.

   subordinate = dev-subordinate;
   list_for_each_entry(child,subordinate-devices, bus_list) {
   dev_info(child-dev, save state\n);
   pci_save_state(child);
 +   pci_write_config_word(child, PCI_COMMAND, 0);
 +   msleep(1000);
 As Linus pointed out in an earlier patch, msleep() after each device
 is s-l-o-w and unnecessary; should do a single msleep(1000) after
 *all* the command registers are written.  That way, the added delay is 1sec,
 not 1sec*ndevs.

Yeah, of course. I added these two lines just to confirm if clearing
command register resolves Bill's problem. They are tentative things.

 q: is this turning off command register for only PCI(e) endpoints?
 -- shouldn't have to turn off command register in bridges/switches.

Yes, only clearing register of endpoints because kdump fails due to
following error when turning off command register of switch as well.

Fusion MPT SAS Host driver 3.04.20
mptbase: ioc0: Initiating bringup
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: ERROR - Failed to come READY after reset! IocState=f000
mptbase: ioc0: WARNING - ResetHistory bit failed to clear!
mptbase: ioc0: ERROR - Diagnostic reset FAILED! (h)
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: ioc0: ERROR - didn't initialize properly! (-1)
mptsas: probe of :50:00.0 failed with error -1

Thanks,
Takao Indoh

 
   }

 Thanks,
 Takao Indoh



 Bill Sumner

 -Original Message-
 From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don 
 Dutile
 Sent: Friday, April 26, 2013 8:43 AM
 To: Takao Indoh
 Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
 linux-kernel@vger.kernel.org; tin...@gmail.com; 
 io...@lists.linux-foundation.org; bhelg...@google.com
 Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

 On 04/25/2013 11:10 PM, Takao Indoh wrote:
 (2013/04/26 3:01), Don Dutile wrote:
 On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on 
 is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA 
 causes
   dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/05/08 7:04), Alex Williamson wrote:
 On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote:
 On 05/07/2013 12:39 PM, Alex Williamson wrote:
 On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
  and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
  dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 https://lkml.org/lkml/2012/11/26/814

 Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
 ---
Documentation/kernel-parameters.txt |2 +
drivers/pci/pci.c   |  103 
 +++
2 files changed, 105 insertions(+), 0 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 4609e81..2a31ade 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
any pair of devices, possibly at the 
 cost of
reduced performance.  This also 
 guarantees
that hot-added devices will work.
 +  pcie_reset_devices  Reset PCIe endpoint on boot by hot
 +  reset
cbiosize=nn[KMG]The fixed amount of bus space 
 which is
reserved for the CardBus bridge's IO 
 window.
The default value is 256 bytes.
 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index b099e00..42385c9 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
}
EXPORT_SYMBOL(pci_fixup_cardbus);

 +/*
 + * Return true if dev is PCIe root port or downstream port whose child is 
 PCIe
 + * endpoint except VGA device.
 + */
 +static int __init need_reset(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!pci_is_pcie(dev) || !dev-subordinate ||
 +  list_empty(dev-subordinate-devices) ||
 +  ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
 +   (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
 +  return 0;
 +
 +  subordinate = dev-subordinate;
 +  list_for_each_entry(child,subordinate-devices, bus_list) {
 +  if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
 +  (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
 +  ((child-class  16) == PCI_BASE_CLASS_DISPLAY))
 +  /* Don't reset switch, bridge, VGA device */
 +  return 0;
 +  }
 +
 +  return 1;
 +}
 +
 +static void __init save_config(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!need_reset(dev))
 +  return;
 +
 +  subordinate = dev-subordinate;
 +  list_for_each_entry(child,subordinate-devices, bus_list) {
 +  dev_info(child-dev, save state\n);
 +  pci_save_state(child);
 +  }
 +}
 +
 +static void __init restore_config(struct pci_dev *dev)
 +{
 +  struct pci_bus *subordinate;
 +  struct pci_dev *child;
 +
 +  if (!need_reset(dev))
 +  return;
 +
 +  subordinate = dev-subordinate;
 +  list_for_each_entry(child,subordinate-devices, bus_list) {
 +  dev_info(child-dev, restore state\n);
 +  pci_restore_state(child);
 +  }
 +}
 +
 +static void __init do_device_reset(struct pci_dev *dev)
 +{
 +  u16 ctrl;
 +
 +  if (!need_reset(dev))
 +  return;
 +
 +  dev_info(dev-dev, Reset Secondary bus\n);
 +
 +  /* Assert Secondary Bus Reset */
 +  pci_read_config_word(dev, 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-08 Thread Takao Indoh
(2013/04/26 3:01), Don Dutile wrote:
 On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
   dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
is the VGA always function 0, so this scan sees it first  doesn't
do a reset on that PCIe link?  if the VGA is not function 0, won't
this logic break (will reset b/c function 0 is non-VGA graphics) ?

 VGA is not reset irrespective of its function number. The logic of this
 patch is:

 for_each_pci_dev(dev) {
   if (dev is not PCIe)
  continue;
   if (dev is not root port/downstream port) ---(1)
  continue;
   list_for_each_entry(child,dev-subordinate-devices, bus_list) {
   if (child is upstream port or bridge or VGA) ---(2)
   continue;
   }
   do_reset_its_child(dev);
 }

 Therefore VGA itself is skipped by (1), and upstream device(root port or
 downstream port) of VGA is also skipped by (2).


 (2) I'm hearing VGA will soon not be the a required console; this logic
assumes it is, and why it isn't blanked.
Q: Should the filter be based on a device having a device-class of 
 display ?

 I want to avoid the situation that user's monitor blacks out and user
 cannot know what's going on. That's reason why I introduced the logic to
 skip VGA. As far as I tested the logic based on device-class works well,
 sorry, I read your description, which said VGA, but your are filtering on 
 display class,
 which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
 display devices
 are probably one of the most aggressive DMA engines on a system and will 
 grow as
 asymmetric processing using GPUs gets architected into a device-agnostic 
 manner.
 So, this may work well for servers, which is the primary consumer/user of 
 this feature,
 and they typically have built-in graphics that are generally used in simple 
 VGA mode,
 so this may be sufficient for now.
   
 
 but I would appreciate it if there are better ways.

 You probably don't want to hear it but
 a) only turn off cmd-reg master enable bit
 b) only do reset based on a list of devices known not to
 obey their cmd-reg master enable bit, and only do reset to those devices.
 But, given the testing you've done so far, this optional (need cmdline) 
 feature,
 let's start here.
 

 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.

 Thank you for your review!


 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
   dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
   That logic, along w/your previous patch to disable the IOMMU if 
 iommu=off
   is set, would remove this (relatively slow) PCI init sequencing ?

 To force iommu off, all ongoing DMA have to be stopped before that since
 they are accessing the device address, not physical address. If we disable
 iommu without stopping in-flihgt DMA, devices access invalid memory area
 and it causes memory corruption or PCI-SERR due to DMA error.
 Right, that's a 'duh' on my part.
 I thought 'disable iommu' == 'block all dma' and it just turns it off 
 let's the ongoing DMA run...
 Please ignore this question... sigh.
 

 So, whether we use iommu or not in second kernel, we have to stop DMA in
 second kernel if iommu is used in first kernel.

 Thanks,
 Takao Indoh



 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Alex Williamson
On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote:
> On 05/07/2013 12:39 PM, Alex Williamson wrote:
> > On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
> >> This patch resets PCIe devices on boot to stop ongoing DMA. When
> >> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
> >> PCIe root port and downstream port to reset its downstream endpoint.
> >>
> >> Problem:
> >> This patch solves the problem that kdump can fail when intel_iommu=on is
> >> specified. When intel_iommu=on is specified, many dma-remapping errors
> >> occur in second kernel and it causes problems like driver error or PCI
> >> SERR, at last kdump fails. This problem is caused as follows.
> >> 1) Devices are working on first kernel.
> >> 2) Switch to second kernel(kdump kernel). The devices are still working
> >> and its DMA continues during this switch.
> >> 3) iommu is initialized during second kernel boot and ongoing DMA causes
> >> dma-remapping errors.
> >>
> >> Solution:
> >> All DMA transactions have to be stopped before iommu is initialized. By
> >> this patch devices are reset and in-flight DMA is stopped before
> >> pci_iommu_init.
> >>
> >> To invoke hot reset on an endpoint, its upstream link need to be reset.
> >> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
> >> port/downstream port whose child is PCIe endpoint, and then reset link
> >> between them. If the endpoint is VGA device, it is skipped because the
> >> monitor blacks out if VGA controller is reset.
> >>
> >> Actually this is v8 patch but quite different from v7 and it's been so
> >> long since previous post, so I start over again.
> >> Previous post:
> >> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
> >> https://lkml.org/lkml/2012/11/26/814
> >>
> >> Signed-off-by: Takao Indoh
> >> ---
> >>   Documentation/kernel-parameters.txt |2 +
> >>   drivers/pci/pci.c   |  103 
> >> +++
> >>   2 files changed, 105 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/Documentation/kernel-parameters.txt 
> >> b/Documentation/kernel-parameters.txt
> >> index 4609e81..2a31ade 100644
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
> >> entirely omitted.
> >>any pair of devices, possibly at the cost of
> >>reduced performance.  This also guarantees
> >>that hot-added devices will work.
> >> +  pcie_reset_devices  Reset PCIe endpoint on boot by hot
> >> +  reset
> >>cbiosize=nn[KMG]The fixed amount of bus space which is
> >>reserved for the CardBus bridge's IO window.
> >>The default value is 256 bytes.
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index b099e00..42385c9 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
> >>   }
> >>   EXPORT_SYMBOL(pci_fixup_cardbus);
> >>
> >> +/*
> >> + * Return true if dev is PCIe root port or downstream port whose child is 
> >> PCIe
> >> + * endpoint except VGA device.
> >> + */
> >> +static int __init need_reset(struct pci_dev *dev)
> >> +{
> >> +  struct pci_bus *subordinate;
> >> +  struct pci_dev *child;
> >> +
> >> +  if (!pci_is_pcie(dev) || !dev->subordinate ||
> >> +  list_empty(>subordinate->devices) ||
> >> +  ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
> >> +   (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> >> +  return 0;
> >> +
> >> +  subordinate = dev->subordinate;
> >> +  list_for_each_entry(child,>devices, bus_list) {
> >> +  if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> >> +  (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> >> +  ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
> >> +  /* Don't reset switch, bridge, VGA device */
> >> +  return 0;
> >> +  }
> >> +
> >> +  return 1;
> >> +}
> >> +
> >> +static void __init save_config(struct pci_dev *dev)
> >> +{
> >> +  struct pci_bus *subordinate;
> >> +  struct pci_dev *child;
> >> +
> >> +  if (!need_reset(dev))
> >> +  return;
> >> +
> >> +  subordinate = dev->subordinate;
> >> +  list_for_each_entry(child,>devices, bus_list) {
> >> +  dev_info(>dev, "save state\n");
> >> +  pci_save_state(child);
> >> +  }
> >> +}
> >> +
> >> +static void __init restore_config(struct pci_dev *dev)
> >> +{
> >> +  struct pci_bus *subordinate;
> >> +  struct pci_dev *child;
> >> +
> >> +  if (!need_reset(dev))
> >> +  return;
> >> +
> >> +  subordinate = dev->subordinate;
> >> +  list_for_each_entry(child,>devices, bus_list) {
> >> +  dev_info(>dev, "restore state\n");
> >> +

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Don Dutile

On 05/07/2013 12:39 PM, Alex Williamson wrote:

On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.

Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.
Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indoh
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev->subordinate ||
+   list_empty(>subordinate->devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,>devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,>devices, bus_list) {
+   dev_info(>dev, "save state\n");
+   pci_save_state(child);
+   }
+}
+
+static void __init restore_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,>devices, bus_list) {
+   dev_info(>dev, "restore state\n");
+   pci_restore_state(child);
+   }
+}
+
+static void __init do_device_reset(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   if (!need_reset(dev))
+   return;
+
+   dev_info(>dev, "Reset Secondary bus\n");
+
+   /* Assert Secondary Bus Reset */
+   pci_read_config_word(dev, PCI_BRIDGE_CONTROL,);
+   ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+   pci_write_config_word(dev, PCI_BRIDGE_CONTROL, 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Alex Williamson
On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
> PCIe root port and downstream port to reset its downstream endpoint.
> 
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>dma-remapping errors.
> 
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
> 
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
> port/downstream port whose child is PCIe endpoint, and then reset link
> between them. If the endpoint is VGA device, it is skipped because the
> monitor blacks out if VGA controller is reset.
> 
> Actually this is v8 patch but quite different from v7 and it's been so
> long since previous post, so I start over again.
> Previous post:
> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
> https://lkml.org/lkml/2012/11/26/814
> 
> Signed-off-by: Takao Indoh 
> ---
>  Documentation/kernel-parameters.txt |2 +
>  drivers/pci/pci.c   |  103 
> +++
>  2 files changed, 105 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 4609e81..2a31ade 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   any pair of devices, possibly at the cost of
>   reduced performance.  This also guarantees
>   that hot-added devices will work.
> + pcie_reset_devices  Reset PCIe endpoint on boot by hot
> + reset
>   cbiosize=nn[KMG]The fixed amount of bus space which is
>   reserved for the CardBus bridge's IO window.
>   The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b099e00..42385c9 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_fixup_cardbus);
>  
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is 
> PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!pci_is_pcie(dev) || !dev->subordinate ||
> + list_empty(>subordinate->devices) ||
> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +  (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> + return 0;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, >devices, bus_list) {
> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + /* Don't reset switch, bridge, VGA device */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +static void __init save_config(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!need_reset(dev))
> + return;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, >devices, bus_list) {
> + dev_info(>dev, "save state\n");
> + pci_save_state(child);
> + }
> +}
> +
> +static void __init restore_config(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!need_reset(dev))
> + return;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, >devices, bus_list) {
> + dev_info(>dev, "restore state\n");
> + pci_restore_state(child);
> + }
> +}
> +
> +static void __init do_device_reset(struct pci_dev *dev)
> +{
> + u16 ctrl;
> +
> + if (!need_reset(dev))
> + return;
> +
> + dev_info(>dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(dev, 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Don Dutile
On 05/07/2013 03:09 AM, Takao Indoh wrote:
> Sorry for the delayed response.
> 
> (2013/04/30 23:54), Sumner, William wrote:
>> I have installed your original patch set (from last November) and tested 
>> with three platforms, each with a different IO configuration.  On the first 
>> platform crashdumps were consistently successful.  On the second and third 
>> platforms, the reset of one specific PCI device on each platform (a 
>> different device type on each platform) immediately hung the crash kernel. 
>> This was consistent across multiple tries.  Temporarily modifying the patch 
>> to skip resetting the one problem-device on each platform allowed each 
>> platform to create a dump successfully.
>>
>> I am now working with our IO team to determine the exact nature of the hang 
>> and to see if the hang is unique to the specific cards or the specific 
>> platforms.
>>
>> Also I am starting to look at an alternate possibility:
>>
>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>> Reset) describes reset use-cases for "a partitioned environment where 
>> hardware is migrated from one partition to another" and for "system software 
>> is taking down the software stack for a Function and then rebuilding that 
>> stack".
>>
>> These use-cases seem very similar to transitioning into the crash kernel.  
>> The "Implementation Note" at the end of that section describes an algorithm 
>> for "Avoiding Data Corruption From Stale Completions" which looks like it 
>> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
>> of the steps from this algorithm to one of the already proposed patches 
>> would avoid the hang that I saw on two platforms.
> 
> It seems that the algorithm you mentioned requires four steps.
> 
> 1. Clear Command register
> 2. Wait a few milliseconds (Or polling Transactions Pending bit)
> 3. Do FLR
> 4. Wait 100 ms
> 
> My patch does not do step 1 and 2. So, I would appreciate it if you
> could add the following into save_config() in my latest patch and
> confirm if kernel still hangs up or not.
> 
>  subordinate = dev->subordinate;
>  list_for_each_entry(child,>devices, bus_list) {
>  dev_info(>dev, "save state\n");
>  pci_save_state(child);
> +   pci_write_config_word(child, PCI_COMMAND, 0);
> +   msleep(1000);
As Linus pointed out in an earlier patch, msleep() after each device
is s-l-o-w and unnecessary; should do a single msleep(1000) after
*all* the command registers are written.  That way, the added delay is 1sec,
not 1sec*ndevs.
q: is this turning off command register for only PCI(e) endpoints?
-- shouldn't have to turn off command register in bridges/switches.

>  }
> 
> Thanks,
> Takao Indoh
> 
> 
>>
>> Bill Sumner
>>
>> -Original Message-
>> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don 
>> Dutile
>> Sent: Friday, April 26, 2013 8:43 AM
>> To: Takao Indoh
>> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
>> linux-kernel@vger.kernel.org; tin...@gmail.com; 
>> io...@lists.linux-foundation.org; bhelg...@google.com
>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>>
>> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>>> (2013/04/26 3:01), Don Dutile wrote:
>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>>
>>>>>>> Problem:
>>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>>> 1) Devices are working on first kernel.
>>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>>>  and its DMA continues during this switch.
>>>>>>> 3) iommu is initialized during second kernel boot and

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Takao Indoh
Sorry for the delayed response.

(2013/04/30 23:54), Sumner, William wrote:
> I have installed your original patch set (from last November) and tested with 
> three platforms, each with a different IO configuration.  On the first 
> platform crashdumps were consistently successful.  On the second and third 
> platforms, the reset of one specific PCI device on each platform (a different 
> device type on each platform) immediately hung the crash kernel. This was 
> consistent across multiple tries.  Temporarily modifying the patch to skip 
> resetting the one problem-device on each platform allowed each platform to 
> create a dump successfully.
> 
> I am now working with our IO team to determine the exact nature of the hang 
> and to see if the hang is unique to the specific cards or the specific 
> platforms.
> 
> Also I am starting to look at an alternate possibility:
> 
> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
> Reset) describes reset use-cases for "a partitioned environment where 
> hardware is migrated from one partition to another" and for "system software 
> is taking down the software stack for a Function and then rebuilding that 
> stack".
> 
> These use-cases seem very similar to transitioning into the crash kernel.  
> The "Implementation Note" at the end of that section describes an algorithm 
> for "Avoiding Data Corruption From Stale Completions" which looks like it 
> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
> of the steps from this algorithm to one of the already proposed patches would 
> avoid the hang that I saw on two platforms.

It seems that the algorithm you mentioned requires four steps.

1. Clear Command register
2. Wait a few milliseconds (Or polling Transactions Pending bit)
3. Do FLR
4. Wait 100 ms

My patch does not do step 1 and 2. So, I would appreciate it if you
could add the following into save_config() in my latest patch and
confirm if kernel still hangs up or not.

subordinate = dev->subordinate;
list_for_each_entry(child, >devices, bus_list) {
dev_info(>dev, "save state\n");
pci_save_state(child);
+   pci_write_config_word(child, PCI_COMMAND, 0);
+   msleep(1000);
}

Thanks,
Takao Indoh


> 
> Bill Sumner
> 
> -Original Message-
> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
> Sent: Friday, April 26, 2013 8:43 AM
> To: Takao Indoh
> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
> linux-kernel@vger.kernel.org; tin...@gmail.com; 
> io...@lists.linux-foundation.org; bhelg...@google.com
> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
> 
> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>> (2013/04/26 3:01), Don Dutile wrote:
>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>
>>>>>> Problem:
>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>> 1) Devices are working on first kernel.
>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>> and its DMA continues during this switch.
>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>> dma-remapping errors.
>>>>>>
>>>>>> Solution:
>>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>>> pci_iommu_init.
>>>>>>
>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>>> between them. If the endpoint is VGA device, it is skipped because the
>

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Alex Williamson
On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.
 
 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.
 
 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.
 
 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.
 
 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 https://lkml.org/lkml/2012/11/26/814
 
 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 ---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 
 +++
  2 files changed, 105 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 4609e81..2a31ade 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
   any pair of devices, possibly at the cost of
   reduced performance.  This also guarantees
   that hot-added devices will work.
 + pcie_reset_devices  Reset PCIe endpoint on boot by hot
 + reset
   cbiosize=nn[KMG]The fixed amount of bus space which is
   reserved for the CardBus bridge's IO window.
   The default value is 256 bytes.
 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index b099e00..42385c9 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);
  
 +/*
 + * Return true if dev is PCIe root port or downstream port whose child is 
 PCIe
 + * endpoint except VGA device.
 + */
 +static int __init need_reset(struct pci_dev *dev)
 +{
 + struct pci_bus *subordinate;
 + struct pci_dev *child;
 +
 + if (!pci_is_pcie(dev) || !dev-subordinate ||
 + list_empty(dev-subordinate-devices) ||
 + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) 
 +  (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
 + return 0;
 +
 + subordinate = dev-subordinate;
 + list_for_each_entry(child, subordinate-devices, bus_list) {
 + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
 + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
 + ((child-class  16) == PCI_BASE_CLASS_DISPLAY))
 + /* Don't reset switch, bridge, VGA device */
 + return 0;
 + }
 +
 + return 1;
 +}
 +
 +static void __init save_config(struct pci_dev *dev)
 +{
 + struct pci_bus *subordinate;
 + struct pci_dev *child;
 +
 + if (!need_reset(dev))
 + return;
 +
 + subordinate = dev-subordinate;
 + list_for_each_entry(child, subordinate-devices, bus_list) {
 + dev_info(child-dev, save state\n);
 + pci_save_state(child);
 + }
 +}
 +
 +static void __init restore_config(struct pci_dev *dev)
 +{
 + struct pci_bus *subordinate;
 + struct pci_dev *child;
 +
 + if (!need_reset(dev))
 + return;
 +
 + subordinate = dev-subordinate;
 + list_for_each_entry(child, subordinate-devices, bus_list) {
 + dev_info(child-dev, restore state\n);
 + pci_restore_state(child);
 + }
 +}
 +
 +static void __init do_device_reset(struct pci_dev *dev)
 +{
 + u16 ctrl;
 +
 + if (!need_reset(dev))
 + return;
 +
 + dev_info(dev-dev, Reset Secondary bus\n);
 +
 + /* Assert Secondary Bus Reset */
 + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
 + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
 + 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Don Dutile

On 05/07/2013 12:39 PM, Alex Williamson wrote:

On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
pci=pcie_reset_devices is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.

Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.
Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev-subordinate ||
+   list_empty(dev-subordinate-devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev-subordinate;
+   list_for_each_entry(child,subordinate-devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child-class  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev-subordinate;
+   list_for_each_entry(child,subordinate-devices, bus_list) {
+   dev_info(child-dev, save state\n);
+   pci_save_state(child);
+   }
+}
+
+static void __init restore_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev-subordinate;
+   list_for_each_entry(child,subordinate-devices, bus_list) {
+   dev_info(child-dev, restore state\n);
+   pci_restore_state(child);
+   }
+}
+
+static void __init do_device_reset(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   if (!need_reset(dev))
+   return;
+
+   dev_info(dev-dev, Reset Secondary bus\n);
+
+   /* Assert Secondary Bus Reset */
+   pci_read_config_word(dev, PCI_BRIDGE_CONTROL,ctrl);
+   ctrl |= 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Alex Williamson
On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote:
 On 05/07/2013 12:39 PM, Alex Williamson wrote:
  On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
  This patch resets PCIe devices on boot to stop ongoing DMA. When
  pci=pcie_reset_devices is specified, a hot reset is triggered on each
  PCIe root port and downstream port to reset its downstream endpoint.
 
  Problem:
  This patch solves the problem that kdump can fail when intel_iommu=on is
  specified. When intel_iommu=on is specified, many dma-remapping errors
  occur in second kernel and it causes problems like driver error or PCI
  SERR, at last kdump fails. This problem is caused as follows.
  1) Devices are working on first kernel.
  2) Switch to second kernel(kdump kernel). The devices are still working
  and its DMA continues during this switch.
  3) iommu is initialized during second kernel boot and ongoing DMA causes
  dma-remapping errors.
 
  Solution:
  All DMA transactions have to be stopped before iommu is initialized. By
  this patch devices are reset and in-flight DMA is stopped before
  pci_iommu_init.
 
  To invoke hot reset on an endpoint, its upstream link need to be reset.
  reset_pcie_devices() is called from fs_initcall_sync, and it finds root
  port/downstream port whose child is PCIe endpoint, and then reset link
  between them. If the endpoint is VGA device, it is skipped because the
  monitor blacks out if VGA controller is reset.
 
  Actually this is v8 patch but quite different from v7 and it's been so
  long since previous post, so I start over again.
  Previous post:
  [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
  https://lkml.org/lkml/2012/11/26/814
 
  Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
  ---
Documentation/kernel-parameters.txt |2 +
drivers/pci/pci.c   |  103 
  +++
2 files changed, 105 insertions(+), 0 deletions(-)
 
  diff --git a/Documentation/kernel-parameters.txt 
  b/Documentation/kernel-parameters.txt
  index 4609e81..2a31ade 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
  entirely omitted.
 any pair of devices, possibly at the cost of
 reduced performance.  This also guarantees
 that hot-added devices will work.
  +  pcie_reset_devices  Reset PCIe endpoint on boot by hot
  +  reset
 cbiosize=nn[KMG]The fixed amount of bus space which is
 reserved for the CardBus bridge's IO window.
 The default value is 256 bytes.
  diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
  index b099e00..42385c9 100644
  --- a/drivers/pci/pci.c
  +++ b/drivers/pci/pci.c
  @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
}
EXPORT_SYMBOL(pci_fixup_cardbus);
 
  +/*
  + * Return true if dev is PCIe root port or downstream port whose child is 
  PCIe
  + * endpoint except VGA device.
  + */
  +static int __init need_reset(struct pci_dev *dev)
  +{
  +  struct pci_bus *subordinate;
  +  struct pci_dev *child;
  +
  +  if (!pci_is_pcie(dev) || !dev-subordinate ||
  +  list_empty(dev-subordinate-devices) ||
  +  ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
  +   (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
  +  return 0;
  +
  +  subordinate = dev-subordinate;
  +  list_for_each_entry(child,subordinate-devices, bus_list) {
  +  if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
  +  (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
  +  ((child-class  16) == PCI_BASE_CLASS_DISPLAY))
  +  /* Don't reset switch, bridge, VGA device */
  +  return 0;
  +  }
  +
  +  return 1;
  +}
  +
  +static void __init save_config(struct pci_dev *dev)
  +{
  +  struct pci_bus *subordinate;
  +  struct pci_dev *child;
  +
  +  if (!need_reset(dev))
  +  return;
  +
  +  subordinate = dev-subordinate;
  +  list_for_each_entry(child,subordinate-devices, bus_list) {
  +  dev_info(child-dev, save state\n);
  +  pci_save_state(child);
  +  }
  +}
  +
  +static void __init restore_config(struct pci_dev *dev)
  +{
  +  struct pci_bus *subordinate;
  +  struct pci_dev *child;
  +
  +  if (!need_reset(dev))
  +  return;
  +
  +  subordinate = dev-subordinate;
  +  list_for_each_entry(child,subordinate-devices, bus_list) {
  +  dev_info(child-dev, restore state\n);
  +  pci_restore_state(child);
  +  }
  +}
  +
  +static void __init do_device_reset(struct pci_dev *dev)
  +{
  +  u16 ctrl;
  +
  +  if (!need_reset(dev))
  +  return;
  +
  +  dev_info(dev-dev, Reset Secondary bus\n);
  +
  +  /* Assert Secondary Bus Reset */
  +  

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-30 Thread Don Dutile

On 04/30/2013 10:54 AM, Sumner, William wrote:

I have installed your original patch set (from last November) and tested with 
three platforms, each with a different IO configuration.  On the first platform 
crashdumps were consistently successful.  On the second and third platforms, 
the reset of one specific PCI device on each platform (a different device type 
on each platform) immediately hung the crash kernel. This was consistent across 
multiple tries.  Temporarily modifying the patch to skip resetting the one 
problem-device on each platform allowed each platform to create a dump 
successfully.

I am now working with our IO team to determine the exact nature of the hang and 
to see if the hang is unique to the specific cards or the specific platforms.

Also I am starting to look at an alternate possibility:

The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level Reset) describes reset 
use-cases for "a partitioned environment where hardware is migrated from one partition to 
another" and for "system software is taking down the software stack for a Function and 
then rebuilding that stack".


Same section in PCIe spec v3.0.
The first paragraph after the 3 (square) bullets in 6.6.2 states: 
"Implemenation of FLR is optional (not required), but is strongly recommended.

This optional feature complicates the reset code, and may be the
reason you are seeing hangs.  do an lspci -xxx dump to see if the
devices you need to skip on reset have FLR capability (find device cap 
register, offset 0x4 in PCIe Cap struct, bit 28).

The only thing the device-endpoint-reset() doesn't do is monitor/check the
device's Device Status register (offset 0xA in PCI Cap struct) Transaction 
Pending bit.
The spec says that could take seconds to clear... typically milliseconds.
If typical, the reset mpsleep() handles that range.  Scanning the endpt-pdev 
list
to see if any Transaction Pending bits are set could be added, and then 
additional
1->2 second delay if still set added... potentially giving up after 2 secs.


These use-cases seem very similar to transitioning into the crash kernel.  The "Implementation 
Note" at the end of that section describes an algorithm for "Avoiding Data Corruption 
From Stale Completions" which looks like it might be applicable to stopping ongoing DMA.  I am 
hopeful that adding some of the steps from this algorithm to one of the already proposed patches 
would avoid the hang that I saw on two platforms.

Bill Sumner

-Original Message-
From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
Sent: Friday, April 26, 2013 8:43 AM
To: Takao Indoh
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

On 04/25/2013 11:10 PM, Takao Indoh wrote:

(2013/04/26 3:01), Don Dutile wrote:

On 04/25/2013 01:11 AM, Takao Indoh wrote:

(2013/04/25 4:59), Don Dutile wrote:

On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it first&doesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?


VGA is not reset irrespective of its function number. The logic of this
patch is:

for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
   

RE: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-30 Thread Sumner, William
I have installed your original patch set (from last November) and tested with 
three platforms, each with a different IO configuration.  On the first platform 
crashdumps were consistently successful.  On the second and third platforms, 
the reset of one specific PCI device on each platform (a different device type 
on each platform) immediately hung the crash kernel. This was consistent across 
multiple tries.  Temporarily modifying the patch to skip resetting the one 
problem-device on each platform allowed each platform to create a dump 
successfully.

I am now working with our IO team to determine the exact nature of the hang and 
to see if the hang is unique to the specific cards or the specific platforms.

Also I am starting to look at an alternate possibility:

The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level Reset) 
describes reset use-cases for "a partitioned environment where hardware is 
migrated from one partition to another" and for "system software is taking down 
the software stack for a Function and then rebuilding that stack".

These use-cases seem very similar to transitioning into the crash kernel.  The 
"Implementation Note" at the end of that section describes an algorithm for 
"Avoiding Data Corruption From Stale Completions" which looks like it might be 
applicable to stopping ongoing DMA.  I am hopeful that adding some of the steps 
from this algorithm to one of the already proposed patches would avoid the hang 
that I saw on two platforms.

Bill Sumner

-Original Message-
From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
Sent: Friday, April 26, 2013 8:43 AM
To: Takao Indoh
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

On 04/25/2013 11:10 PM, Takao Indoh wrote:
> (2013/04/26 3:01), Don Dutile wrote:
>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>> (2013/04/25 4:59), Don Dutile wrote:
>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>
>>>>> Problem:
>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>> 1) Devices are working on first kernel.
>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>and its DMA continues during this switch.
>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>dma-remapping errors.
>>>>>
>>>>> Solution:
>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>> pci_iommu_init.
>>>>>
>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>> monitor blacks out if VGA controller is reset.
>>>>>
>>>> Couple questions wrt VGA device:
>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>> is the VGA always function 0, so this scan sees it first&   doesn't
>>>> do a reset on that PCIe link?  if the VGA is not function 0, won't
>>>> this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>>
>>> VGA is not reset irrespective of its function number. The logic of this
>>> patch is:
>>>
>>> for_each_pci_dev(dev) {
>>>if (dev is not PCIe)
>>>   continue;
>>>if (dev is not root port/downstream port) ---(1)
>>>   continue;
>>>list_for_each_entry(child,>subordinate->devices, bus_list) {
>>>if (child is upstream port or bridge or VGA) ---(2)
>>>continue;
>>>}
>&g

RE: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-30 Thread Sumner, William
I have installed your original patch set (from last November) and tested with 
three platforms, each with a different IO configuration.  On the first platform 
crashdumps were consistently successful.  On the second and third platforms, 
the reset of one specific PCI device on each platform (a different device type 
on each platform) immediately hung the crash kernel. This was consistent across 
multiple tries.  Temporarily modifying the patch to skip resetting the one 
problem-device on each platform allowed each platform to create a dump 
successfully.

I am now working with our IO team to determine the exact nature of the hang and 
to see if the hang is unique to the specific cards or the specific platforms.

Also I am starting to look at an alternate possibility:

The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level Reset) 
describes reset use-cases for a partitioned environment where hardware is 
migrated from one partition to another and for system software is taking down 
the software stack for a Function and then rebuilding that stack.

These use-cases seem very similar to transitioning into the crash kernel.  The 
Implementation Note at the end of that section describes an algorithm for 
Avoiding Data Corruption From Stale Completions which looks like it might be 
applicable to stopping ongoing DMA.  I am hopeful that adding some of the steps 
from this algorithm to one of the already proposed patches would avoid the hang 
that I saw on two platforms.

Bill Sumner

-Original Message-
From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
Sent: Friday, April 26, 2013 8:43 AM
To: Takao Indoh
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

On 04/25/2013 11:10 PM, Takao Indoh wrote:
 (2013/04/26 3:01), Don Dutile wrote:
 On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it first   doesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?

 VGA is not reset irrespective of its function number. The logic of this
 patch is:

 for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
list_for_each_entry(child,dev-subordinate-devices, bus_list) {
if (child is upstream port or bridge or VGA) ---(2)
continue;
}
do_reset_its_child(dev);
 }

 Therefore VGA itself is skipped by (1), and upstream device(root port or
 downstream port) of VGA is also skipped by (2).


 (2) I'm hearing VGA will soon not be the a required console; this logic
 assumes it is, and why it isn't blanked.
 Q: Should the filter be based on a device having a device-class of 
 display ?

 I want to avoid the situation that user's monitor blacks out and user
 cannot know what's going on. That's reason why I introduced the logic to
 skip VGA. As far as I tested the logic based on device-class works well,
 sorry, I read your description, which said VGA, but your are filtering on 
 display class,
 which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
 display devices
 are probably one of the most aggressive DMA engines on a system and will 
 grow

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-30 Thread Don Dutile

On 04/30/2013 10:54 AM, Sumner, William wrote:

I have installed your original patch set (from last November) and tested with 
three platforms, each with a different IO configuration.  On the first platform 
crashdumps were consistently successful.  On the second and third platforms, 
the reset of one specific PCI device on each platform (a different device type 
on each platform) immediately hung the crash kernel. This was consistent across 
multiple tries.  Temporarily modifying the patch to skip resetting the one 
problem-device on each platform allowed each platform to create a dump 
successfully.

I am now working with our IO team to determine the exact nature of the hang and 
to see if the hang is unique to the specific cards or the specific platforms.

Also I am starting to look at an alternate possibility:

The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level Reset) describes reset 
use-cases for a partitioned environment where hardware is migrated from one partition to 
another and for system software is taking down the software stack for a Function and 
then rebuilding that stack.


Same section in PCIe spec v3.0.
The first paragraph after the 3 (square) bullets in 6.6.2 states: 
Implemenation of FLR is optional (not required), but is strongly recommended.

This optional feature complicates the reset code, and may be the
reason you are seeing hangs.  do an lspci -xxx dump to see if the
devices you need to skip on reset have FLR capability (find device cap 
register, offset 0x4 in PCIe Cap struct, bit 28).

The only thing the device-endpoint-reset() doesn't do is monitor/check the
device's Device Status register (offset 0xA in PCI Cap struct) Transaction 
Pending bit.
The spec says that could take seconds to clear... typically milliseconds.
If typical, the reset mpsleep() handles that range.  Scanning the endpt-pdev 
list
to see if any Transaction Pending bits are set could be added, and then 
additional
1-2 second delay if still set added... potentially giving up after 2 secs.


These use-cases seem very similar to transitioning into the crash kernel.  The Implementation 
Note at the end of that section describes an algorithm for Avoiding Data Corruption 
From Stale Completions which looks like it might be applicable to stopping ongoing DMA.  I am 
hopeful that adding some of the steps from this algorithm to one of the already proposed patches 
would avoid the hang that I saw on two platforms.

Bill Sumner

-Original Message-
From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
Sent: Friday, April 26, 2013 8:43 AM
To: Takao Indoh
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-kernel@vger.kernel.org; tin...@gmail.com; 
io...@lists.linux-foundation.org; bhelg...@google.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

On 04/25/2013 11:10 PM, Takao Indoh wrote:

(2013/04/26 3:01), Don Dutile wrote:

On 04/25/2013 01:11 AM, Takao Indoh wrote:

(2013/04/25 4:59), Don Dutile wrote:

On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
pci=pcie_reset_devices is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it firstdoesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?


VGA is not reset irrespective of its function number. The logic of this
patch is:

for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
list_for_each_entry(child,dev-subordinate-devices, bus_list) {
if (child

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-26 Thread Don Dutile
On 04/25/2013 11:10 PM, Takao Indoh wrote:
> (2013/04/26 3:01), Don Dutile wrote:
>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>> (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
> PCIe root port and downstream port to reset its downstream endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>dma-remapping errors.
>
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
> port/downstream port whose child is PCIe endpoint, and then reset link
> between them. If the endpoint is VGA device, it is skipped because the
> monitor blacks out if VGA controller is reset.
>
 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it first&   doesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>>
>>> VGA is not reset irrespective of its function number. The logic of this
>>> patch is:
>>>
>>> for_each_pci_dev(dev) {
>>>if (dev is not PCIe)
>>>   continue;
>>>if (dev is not root port/downstream port) ---(1)
>>>   continue;
>>>list_for_each_entry(child,>subordinate->devices, bus_list) {
>>>if (child is upstream port or bridge or VGA) ---(2)
>>>continue;
>>>}
>>>do_reset_its_child(dev);
>>> }
>>>
>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>> downstream port) of VGA is also skipped by (2).
>>>
>>>
 (2) I'm hearing VGA will soon not be the a required console; this logic
 assumes it is, and why it isn't blanked.
 Q: Should the filter be based on a device having a device-class of 
 display ?
>>>
>>> I want to avoid the situation that user's monitor blacks out and user
>>> cannot know what's going on. That's reason why I introduced the logic to
>>> skip VGA. As far as I tested the logic based on device-class works well,
>> sorry, I read your description, which said VGA, but your are filtering on 
>> display class,
>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
>> display devices
>> are probably one of the most aggressive DMA engines on a system and will 
>> grow as
>> asymmetric processing using GPUs gets architected into a device-agnostic 
>> manner.
>> So, this may work well for servers, which is the primary consumer/user of 
>> this feature,
>> and they typically have built-in graphics that are generally used in simple 
>> VGA mode,
>> so this may be sufficient for now.
> 
> Ok, understood.
> 
> 
>>
>>> but I would appreciate it if there are better ways.
>>>
>> You probably don't want to hear it but
>> a) only turn off cmd-reg master enable bit
>> b) only do reset based on a list of devices known not to
>>  obey their cmd-reg master enable bit, and only do reset to those 
>> devices.
>> But, given the testing you've done so far, this optional (need cmdline) 
>> feature,
>> let's start here.
> 
> Ok. Either way I think we need more testing.
> 
> 

> Actually this is v8 patch but quite different from v7 and it's been so
> long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.
>>>
>>> Thank you for your review!
>>>

 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a 
 crash
dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
That logic, along w/your previous patch to disable the IOMMU if 
 iommu=off
is set, would remove this (relatively slow) PCI init sequencing ?
>>>
>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>> they are accessing the device address, not physical address. If we disable
>>> iommu without stopping in-flihgt DMA, devices access invalid 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-26 Thread Don Dutile
On 04/25/2013 11:10 PM, Takao Indoh wrote:
 (2013/04/26 3:01), Don Dutile wrote:
 On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it first   doesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?

 VGA is not reset irrespective of its function number. The logic of this
 patch is:

 for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
list_for_each_entry(child,dev-subordinate-devices, bus_list) {
if (child is upstream port or bridge or VGA) ---(2)
continue;
}
do_reset_its_child(dev);
 }

 Therefore VGA itself is skipped by (1), and upstream device(root port or
 downstream port) of VGA is also skipped by (2).


 (2) I'm hearing VGA will soon not be the a required console; this logic
 assumes it is, and why it isn't blanked.
 Q: Should the filter be based on a device having a device-class of 
 display ?

 I want to avoid the situation that user's monitor blacks out and user
 cannot know what's going on. That's reason why I introduced the logic to
 skip VGA. As far as I tested the logic based on device-class works well,
 sorry, I read your description, which said VGA, but your are filtering on 
 display class,
 which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
 display devices
 are probably one of the most aggressive DMA engines on a system and will 
 grow as
 asymmetric processing using GPUs gets architected into a device-agnostic 
 manner.
 So, this may work well for servers, which is the primary consumer/user of 
 this feature,
 and they typically have built-in graphics that are generally used in simple 
 VGA mode,
 so this may be sufficient for now.
 
 Ok, understood.
 
 

 but I would appreciate it if there are better ways.

 You probably don't want to hear it but
 a) only turn off cmd-reg master enable bit
 b) only do reset based on a list of devices known not to
  obey their cmd-reg master enable bit, and only do reset to those 
 devices.
 But, given the testing you've done so far, this optional (need cmdline) 
 feature,
 let's start here.
 
 Ok. Either way I think we need more testing.
 
 

 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.

 Thank you for your review!


 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a 
 crash
dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
That logic, along w/your previous patch to disable the IOMMU if 
 iommu=off
is set, would remove this (relatively slow) PCI init sequencing ?

 To force iommu off, all ongoing DMA have to be stopped before that since
 they are accessing the device address, not physical address. If we disable
 iommu without stopping in-flihgt DMA, devices access invalid memory area
 and it causes memory corruption or PCI-SERR due to DMA error.
 Right, that's a 'duh' on my part.
 I thought 'disable iommu' == 'block all dma' and it just turns it off
 let's the ongoing DMA run...
 Please ignore this question... sigh.


 So, whether we use iommu or not in second kernel, we have to stop DMA in
 second kernel if iommu is used in first kernel.

 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-25 Thread Takao Indoh
(2013/04/26 3:01), Don Dutile wrote:
> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>> (2013/04/25 4:59), Don Dutile wrote:
>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
   dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

>>> Couple questions wrt VGA device:
>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>is the VGA always function 0, so this scan sees it first&  doesn't
>>>do a reset on that PCIe link?  if the VGA is not function 0, won't
>>>this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>
>> VGA is not reset irrespective of its function number. The logic of this
>> patch is:
>>
>> for_each_pci_dev(dev) {
>>   if (dev is not PCIe)
>>  continue;
>>   if (dev is not root port/downstream port) ---(1)
>>  continue;
>>   list_for_each_entry(child,>subordinate->devices, bus_list) {
>>   if (child is upstream port or bridge or VGA) ---(2)
>>   continue;
>>   }
>>   do_reset_its_child(dev);
>> }
>>
>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>> downstream port) of VGA is also skipped by (2).
>>
>>
>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>assumes it is, and why it isn't blanked.
>>>Q: Should the filter be based on a device having a device-class of 
>>> display ?
>>
>> I want to avoid the situation that user's monitor blacks out and user
>> cannot know what's going on. That's reason why I introduced the logic to
>> skip VGA. As far as I tested the logic based on device-class works well,
> sorry, I read your description, which said VGA, but your are filtering on 
> display class,
> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
> display devices
> are probably one of the most aggressive DMA engines on a system and will 
> grow as
> asymmetric processing using GPUs gets architected into a device-agnostic 
> manner.
> So, this may work well for servers, which is the primary consumer/user of 
> this feature,
> and they typically have built-in graphics that are generally used in simple 
> VGA mode,
> so this may be sufficient for now.

Ok, understood.


> 
>> but I would appreciate it if there are better ways.
>>
> You probably don't want to hear it but
> a) only turn off cmd-reg master enable bit
> b) only do reset based on a list of devices known not to
> obey their cmd-reg master enable bit, and only do reset to those devices.
> But, given the testing you've done so far, this optional (need cmdline) 
> feature,
> let's start here.

Ok. Either way I think we need more testing.

 
>>>
 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
>>> Thanks for this re-start.  I need to continue reviewing the rest.
>>
>> Thank you for your review!
>>
>>>
>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
>>>   dump?  After the crash dump, the system is rebooting to previous 
>>> (iommu=on) setting.
>>>   That logic, along w/your previous patch to disable the IOMMU if 
>>> iommu=off
>>>   is set, would remove this (relatively slow) PCI init sequencing ?
>>
>> To force iommu off, all ongoing DMA have to be stopped before that since
>> they are accessing the device address, not physical address. If we disable
>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>> and it causes memory corruption or PCI-SERR due to DMA error.
> Right, that's a 'duh' on my part.
> I thought 'disable iommu' == 'block all dma' and it just turns it off &
> let's 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-25 Thread Don Dutile
On 04/25/2013 01:11 AM, Takao Indoh wrote:
> (2013/04/25 4:59), Don Dutile wrote:
>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>
>>> Problem:
>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>> occur in second kernel and it causes problems like driver error or PCI
>>> SERR, at last kdump fails. This problem is caused as follows.
>>> 1) Devices are working on first kernel.
>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>  and its DMA continues during this switch.
>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>  dma-remapping errors.
>>>
>>> Solution:
>>> All DMA transactions have to be stopped before iommu is initialized. By
>>> this patch devices are reset and in-flight DMA is stopped before
>>> pci_iommu_init.
>>>
>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>> between them. If the endpoint is VGA device, it is skipped because the
>>> monitor blacks out if VGA controller is reset.
>>>
>> Couple questions wrt VGA device:
>> (1) Many graphics devices are multi-function, one function being VGA;
>>   is the VGA always function 0, so this scan sees it first&  doesn't
>>   do a reset on that PCIe link?  if the VGA is not function 0, won't
>>   this logic break (will reset b/c function 0 is non-VGA graphics) ?
> 
> VGA is not reset irrespective of its function number. The logic of this
> patch is:
> 
> for_each_pci_dev(dev) {
>  if (dev is not PCIe)
> continue;
>  if (dev is not root port/downstream port) ---(1)
> continue;
>  list_for_each_entry(child,>subordinate->devices, bus_list) {
>  if (child is upstream port or bridge or VGA) ---(2)
>  continue;
>  }
>  do_reset_its_child(dev);
> }
> 
> Therefore VGA itself is skipped by (1), and upstream device(root port or
> downstream port) of VGA is also skipped by (2).
> 
> 
>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>   assumes it is, and why it isn't blanked.
>>   Q: Should the filter be based on a device having a device-class of 
>> display ?
> 
> I want to avoid the situation that user's monitor blacks out and user
> cannot know what's going on. That's reason why I introduced the logic to
> skip VGA. As far as I tested the logic based on device-class works well,
sorry, I read your description, which said VGA, but your are filtering on 
display class,
which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
display devices
are probably one of the most aggressive DMA engines on a system and will 
grow as
asymmetric processing using GPUs gets architected into a device-agnostic manner.
So, this may work well for servers, which is the primary consumer/user of this 
feature,
and they typically have built-in graphics that are generally used in simple VGA 
mode,
so this may be sufficient for now.
 

> but I would appreciate it if there are better ways.
> 
You probably don't want to hear it but
a) only turn off cmd-reg master enable bit
b) only do reset based on a list of devices known not to 
   obey their cmd-reg master enable bit, and only do reset to those devices.
But, given the testing you've done so far, this optional (need cmdline) feature,
let's start here.

>>
>>> Actually this is v8 patch but quite different from v7 and it's been so
>>> long since previous post, so I start over again.
>> Thanks for this re-start.  I need to continue reviewing the rest.
> 
> Thank you for your review!
> 
>>
>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
>>  dump?  After the crash dump, the system is rebooting to previous 
>> (iommu=on) setting.
>>  That logic, along w/your previous patch to disable the IOMMU if 
>> iommu=off
>>  is set, would remove this (relatively slow) PCI init sequencing ?
> 
> To force iommu off, all ongoing DMA have to be stopped before that since
> they are accessing the device address, not physical address. If we disable
> iommu without stopping in-flihgt DMA, devices access invalid memory area
> and it causes memory corruption or PCI-SERR due to DMA error.
Right, that's a 'duh' on my part.
I thought 'disable iommu' == 'block all dma' and it just turns it off &
let's the ongoing DMA run... 
Please ignore this question... sigh.

> 
> So, whether we use iommu or not in second kernel, we have to stop DMA in
> second kernel if iommu is used in first kernel.
> 
> Thanks,
> Takao Indoh
> 
> 
>>
>>> Previous post:

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-25 Thread Don Dutile
On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
  and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
  dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
   is the VGA always function 0, so this scan sees it first  doesn't
   do a reset on that PCIe link?  if the VGA is not function 0, won't
   this logic break (will reset b/c function 0 is non-VGA graphics) ?
 
 VGA is not reset irrespective of its function number. The logic of this
 patch is:
 
 for_each_pci_dev(dev) {
  if (dev is not PCIe)
 continue;
  if (dev is not root port/downstream port) ---(1)
 continue;
  list_for_each_entry(child,dev-subordinate-devices, bus_list) {
  if (child is upstream port or bridge or VGA) ---(2)
  continue;
  }
  do_reset_its_child(dev);
 }
 
 Therefore VGA itself is skipped by (1), and upstream device(root port or
 downstream port) of VGA is also skipped by (2).
 
 
 (2) I'm hearing VGA will soon not be the a required console; this logic
   assumes it is, and why it isn't blanked.
   Q: Should the filter be based on a device having a device-class of 
 display ?
 
 I want to avoid the situation that user's monitor blacks out and user
 cannot know what's going on. That's reason why I introduced the logic to
 skip VGA. As far as I tested the logic based on device-class works well,
sorry, I read your description, which said VGA, but your are filtering on 
display class,
which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
display devices
are probably one of the most aggressive DMA engines on a system and will 
grow as
asymmetric processing using GPUs gets architected into a device-agnostic manner.
So, this may work well for servers, which is the primary consumer/user of this 
feature,
and they typically have built-in graphics that are generally used in simple VGA 
mode,
so this may be sufficient for now.
 

 but I would appreciate it if there are better ways.
 
You probably don't want to hear it but
a) only turn off cmd-reg master enable bit
b) only do reset based on a list of devices known not to 
   obey their cmd-reg master enable bit, and only do reset to those devices.
But, given the testing you've done so far, this optional (need cmdline) feature,
let's start here.


 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.
 
 Thank you for your review!
 

 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
  dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
  That logic, along w/your previous patch to disable the IOMMU if 
 iommu=off
  is set, would remove this (relatively slow) PCI init sequencing ?
 
 To force iommu off, all ongoing DMA have to be stopped before that since
 they are accessing the device address, not physical address. If we disable
 iommu without stopping in-flihgt DMA, devices access invalid memory area
 and it causes memory corruption or PCI-SERR due to DMA error.
Right, that's a 'duh' on my part.
I thought 'disable iommu' == 'block all dma' and it just turns it off 
let's the ongoing DMA run... 
Please ignore this question... sigh.

 
 So, whether we use iommu or not in second kernel, we have to stop DMA in
 second kernel if iommu is used in first kernel.
 
 Thanks,
 Takao Indoh
 
 

 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 https://lkml.org/lkml/2012/11/26/814

 Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
 ---

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-25 Thread Takao Indoh
(2013/04/26 3:01), Don Dutile wrote:
 On 04/25/2013 01:11 AM, Takao Indoh wrote:
 (2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
   and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
   dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
is the VGA always function 0, so this scan sees it first  doesn't
do a reset on that PCIe link?  if the VGA is not function 0, won't
this logic break (will reset b/c function 0 is non-VGA graphics) ?

 VGA is not reset irrespective of its function number. The logic of this
 patch is:

 for_each_pci_dev(dev) {
   if (dev is not PCIe)
  continue;
   if (dev is not root port/downstream port) ---(1)
  continue;
   list_for_each_entry(child,dev-subordinate-devices, bus_list) {
   if (child is upstream port or bridge or VGA) ---(2)
   continue;
   }
   do_reset_its_child(dev);
 }

 Therefore VGA itself is skipped by (1), and upstream device(root port or
 downstream port) of VGA is also skipped by (2).


 (2) I'm hearing VGA will soon not be the a required console; this logic
assumes it is, and why it isn't blanked.
Q: Should the filter be based on a device having a device-class of 
 display ?

 I want to avoid the situation that user's monitor blacks out and user
 cannot know what's going on. That's reason why I introduced the logic to
 skip VGA. As far as I tested the logic based on device-class works well,
 sorry, I read your description, which said VGA, but your are filtering on 
 display class,
 which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
 display devices
 are probably one of the most aggressive DMA engines on a system and will 
 grow as
 asymmetric processing using GPUs gets architected into a device-agnostic 
 manner.
 So, this may work well for servers, which is the primary consumer/user of 
 this feature,
 and they typically have built-in graphics that are generally used in simple 
 VGA mode,
 so this may be sufficient for now.

Ok, understood.


 
 but I would appreciate it if there are better ways.

 You probably don't want to hear it but
 a) only turn off cmd-reg master enable bit
 b) only do reset based on a list of devices known not to
 obey their cmd-reg master enable bit, and only do reset to those devices.
 But, given the testing you've done so far, this optional (need cmdline) 
 feature,
 let's start here.

Ok. Either way I think we need more testing.

 

 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.

 Thank you for your review!


 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
   dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
   That logic, along w/your previous patch to disable the IOMMU if 
 iommu=off
   is set, would remove this (relatively slow) PCI init sequencing ?

 To force iommu off, all ongoing DMA have to be stopped before that since
 they are accessing the device address, not physical address. If we disable
 iommu without stopping in-flihgt DMA, devices access invalid memory area
 and it causes memory corruption or PCI-SERR due to DMA error.
 Right, that's a 'duh' on my part.
 I thought 'disable iommu' == 'block all dma' and it just turns it off 
 let's the ongoing DMA run...
 Please ignore this question... sigh.
 

 So, whether we use iommu or not in second kernel, we have to stop DMA in
 second kernel if iommu is used in first kernel.

 Thanks,
 Takao Indoh



 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-24 Thread Takao Indoh
(2013/04/25 4:59), Don Dutile wrote:
> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>> PCIe root port and downstream port to reset its downstream endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>> and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>> dma-remapping errors.
>>
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>> port/downstream port whose child is PCIe endpoint, and then reset link
>> between them. If the endpoint is VGA device, it is skipped because the
>> monitor blacks out if VGA controller is reset.
>>
> Couple questions wrt VGA device:
> (1) Many graphics devices are multi-function, one function being VGA;
>  is the VGA always function 0, so this scan sees it first & doesn't
>  do a reset on that PCIe link?  if the VGA is not function 0, won't
>  this logic break (will reset b/c function 0 is non-VGA graphics) ?

VGA is not reset irrespective of its function number. The logic of this
patch is:

for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
list_for_each_entry(child,>subordinate->devices, bus_list) {
if (child is upstream port or bridge or VGA) ---(2)
continue;
}
do_reset_its_child(dev);
}

Therefore VGA itself is skipped by (1), and upstream device(root port or
downstream port) of VGA is also skipped by (2).


> (2) I'm hearing VGA will soon not be the a required console; this logic
>  assumes it is, and why it isn't blanked.
>  Q: Should the filter be based on a device having a device-class of 
> display ?

I want to avoid the situation that user's monitor blacks out and user
cannot know what's going on. That's reason why I introduced the logic to
skip VGA. As far as I tested the logic based on device-class works well,
but I would appreciate it if there are better ways.

> 
>> Actually this is v8 patch but quite different from v7 and it's been so
>> long since previous post, so I start over again.
> Thanks for this re-start.  I need to continue reviewing the rest.

Thank you for your review!

> 
> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
> dump?  After the crash dump, the system is rebooting to previous 
> (iommu=on) setting.
> That logic, along w/your previous patch to disable the IOMMU if iommu=off
> is set, would remove this (relatively slow) PCI init sequencing ?

To force iommu off, all ongoing DMA have to be stopped before that since
they are accessing the device address, not physical address. If we disable
iommu without stopping in-flihgt DMA, devices access invalid memory area
and it causes memory corruption or PCI-SERR due to DMA error.

So, whether we use iommu or not in second kernel, we have to stop DMA in
second kernel if iommu is used in first kernel.

Thanks,
Takao Indoh


> 
>> Previous post:
>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>> https://lkml.org/lkml/2012/11/26/814
>>
>> Signed-off-by: Takao Indoh
>> ---
>>   Documentation/kernel-parameters.txt |2 +
>>   drivers/pci/pci.c   |  103 
>> +++
>>   2 files changed, 105 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt 
>> b/Documentation/kernel-parameters.txt
>> index 4609e81..2a31ade 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
>> entirely omitted.
>>   any pair of devices, possibly at the cost of
>>   reduced performance.  This also guarantees
>>   that hot-added devices will work.
>> +pcie_reset_devicesReset PCIe endpoint on boot by hot
>> +reset
>>   cbiosize=nn[KMG]The fixed amount of bus space which is
>>   reserved for the CardBus bridge's IO window.
>>   The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index b099e00..42385c9 100644
>> --- a/drivers/pci/pci.c
>> 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-24 Thread Don Dutile

On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
is the VGA always function 0, so this scan sees it first & doesn't
do a reset on that PCIe link?  if the VGA is not function 0, won't
this logic break (will reset b/c function 0 is non-VGA graphics) ?
(2) I'm hearing VGA will soon not be the a required console; this logic
assumes it is, and why it isn't blanked.
Q: Should the filter be based on a device having a device-class of display ?


Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.

Thanks for this re-start.  I need to continue reviewing the rest.

Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
   dump?  After the crash dump, the system is rebooting to previous (iommu=on) 
setting.
   That logic, along w/your previous patch to disable the IOMMU if iommu=off
   is set, would remove this (relatively slow) PCI init sequencing ?


Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indoh
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev->subordinate ||
+   list_empty(>subordinate->devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,>devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,>devices, bus_list) {
+   dev_info(>dev, 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-24 Thread Don Dutile

On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
pci=pcie_reset_devices is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
is the VGA always function 0, so this scan sees it first  doesn't
do a reset on that PCIe link?  if the VGA is not function 0, won't
this logic break (will reset b/c function 0 is non-VGA graphics) ?
(2) I'm hearing VGA will soon not be the a required console; this logic
assumes it is, and why it isn't blanked.
Q: Should the filter be based on a device having a device-class of display ?


Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.

Thanks for this re-start.  I need to continue reviewing the rest.

Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
   dump?  After the crash dump, the system is rebooting to previous (iommu=on) 
setting.
   That logic, along w/your previous patch to disable the IOMMU if iommu=off
   is set, would remove this (relatively slow) PCI init sequencing ?


Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev-subordinate ||
+   list_empty(dev-subordinate-devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev-subordinate;
+   list_for_each_entry(child,subordinate-devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child-class  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev-subordinate;
+   list_for_each_entry(child,subordinate-devices, 

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-24 Thread Takao Indoh
(2013/04/25 4:59), Don Dutile wrote:
 On 04/24/2013 12:58 AM, Takao Indoh wrote:
 This patch resets PCIe devices on boot to stop ongoing DMA. When
 pci=pcie_reset_devices is specified, a hot reset is triggered on each
 PCIe root port and downstream port to reset its downstream endpoint.

 Problem:
 This patch solves the problem that kdump can fail when intel_iommu=on is
 specified. When intel_iommu=on is specified, many dma-remapping errors
 occur in second kernel and it causes problems like driver error or PCI
 SERR, at last kdump fails. This problem is caused as follows.
 1) Devices are working on first kernel.
 2) Switch to second kernel(kdump kernel). The devices are still working
 and its DMA continues during this switch.
 3) iommu is initialized during second kernel boot and ongoing DMA causes
 dma-remapping errors.

 Solution:
 All DMA transactions have to be stopped before iommu is initialized. By
 this patch devices are reset and in-flight DMA is stopped before
 pci_iommu_init.

 To invoke hot reset on an endpoint, its upstream link need to be reset.
 reset_pcie_devices() is called from fs_initcall_sync, and it finds root
 port/downstream port whose child is PCIe endpoint, and then reset link
 between them. If the endpoint is VGA device, it is skipped because the
 monitor blacks out if VGA controller is reset.

 Couple questions wrt VGA device:
 (1) Many graphics devices are multi-function, one function being VGA;
  is the VGA always function 0, so this scan sees it first  doesn't
  do a reset on that PCIe link?  if the VGA is not function 0, won't
  this logic break (will reset b/c function 0 is non-VGA graphics) ?

VGA is not reset irrespective of its function number. The logic of this
patch is:

for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;
list_for_each_entry(child,dev-subordinate-devices, bus_list) {
if (child is upstream port or bridge or VGA) ---(2)
continue;
}
do_reset_its_child(dev);
}

Therefore VGA itself is skipped by (1), and upstream device(root port or
downstream port) of VGA is also skipped by (2).


 (2) I'm hearing VGA will soon not be the a required console; this logic
  assumes it is, and why it isn't blanked.
  Q: Should the filter be based on a device having a device-class of 
 display ?

I want to avoid the situation that user's monitor blacks out and user
cannot know what's going on. That's reason why I introduced the logic to
skip VGA. As far as I tested the logic based on device-class works well,
but I would appreciate it if there are better ways.

 
 Actually this is v8 patch but quite different from v7 and it's been so
 long since previous post, so I start over again.
 Thanks for this re-start.  I need to continue reviewing the rest.

Thank you for your review!

 
 Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
 dump?  After the crash dump, the system is rebooting to previous 
 (iommu=on) setting.
 That logic, along w/your previous patch to disable the IOMMU if iommu=off
 is set, would remove this (relatively slow) PCI init sequencing ?

To force iommu off, all ongoing DMA have to be stopped before that since
they are accessing the device address, not physical address. If we disable
iommu without stopping in-flihgt DMA, devices access invalid memory area
and it causes memory corruption or PCI-SERR due to DMA error.

So, whether we use iommu or not in second kernel, we have to stop DMA in
second kernel if iommu is used in first kernel.

Thanks,
Takao Indoh


 
 Previous post:
 [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
 https://lkml.org/lkml/2012/11/26/814

 Signed-off-by: Takao Indohindou.ta...@jp.fujitsu.com
 ---
   Documentation/kernel-parameters.txt |2 +
   drivers/pci/pci.c   |  103 
 +++
   2 files changed, 105 insertions(+), 0 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 4609e81..2a31ade 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
   any pair of devices, possibly at the cost of
   reduced performance.  This also guarantees
   that hot-added devices will work.
 +pcie_reset_devicesReset PCIe endpoint on boot by hot
 +reset
   cbiosize=nn[KMG]The fixed amount of bus space which is
   reserved for the CardBus bridge's IO window.
   The default value is 256 bytes.
 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index b099e00..42385c9 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
   }