Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
Hi,

[snip]
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump

We should only care about kdump

> 
> I see little direct relation to SEV here. The only reason SEV makes it more
> relevant, is that you need to have an SWIOTLB region available with SEV
> while without you could live with a disabled IOMMU.


Here is some comment in arch/x86/kernel/pci-swiotlb.c, it is enforced
for some reason.
/*
 * If SME is active then swiotlb will be set to 1 so that bounce
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
if (sme_active())
swiotlb = 1;

> 
> However, I can definitely understand how you would want to have a way to
> tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its
> memory for its own SWIOTLB. That way, you don't have to reserve another 64MB
> of RAM for kdump.
> 
> What I'm curious on is whether we need to be as elaborate. Can't we just
> pass the old SWIOTLB as free memory to the new kexec'ed kernel and
> everything else will fall into place? All that would take is a bit of
> shuffling on the e820 table pass-through to the kexec'ed kernel, no?

Maybe either of the two is fine.  But we may need ensure these swiotlb
area to be reused explictly in some way.  Say about the crashkernel=X,high case,
major part is in above 4G region, and a small piece in low memory. Then
when kernel booting, kernel/driver initialization could use out of the
low memory, and the remain part for swiotlb could be not big enough and
finally swiotlb allocation fails. 

Thanks
Dave

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
On 03/30/20 at 09:40am, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> > On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> > >
> > > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > > The swiotlb is a very convenient fallback mechanism for bounce 
> > > > buffering of
> > > > DMAable data. It is usually used for the compatibility case where 
> > > > devices
> > > > can only DMA to a "low region".
> > > >
> > > > However, in some scenarios this "low region" may be bound even more
> > > > heavily. For example, there are embedded system where only an SRAM 
> > > > region
> > > > is shared between device and CPU. There are also heterogeneous computing
> > > > scenarios where only a subset of RAM is cache coherent between the
> > > > components of the system. There are partitioning hypervisors, where
> > > > a "control VM" that implements device emulation has limited view into a
> > > > partition's memory for DMA capabilities due to safety concerns.
> > > >
> > > > This patch adds a command line driven mechanism to move all DMA memory 
> > > > into
> > > > a predefined shared memory region which may or may not be part of the
> > > > physical address layout of the Operating System.
> > > >
> > > > Ideally, the typical path to set this configuration would be through 
> > > > Device
> > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. 
> > > > Also,
> > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > > instead configure the system purely through kernel command line options.
> > > >
> > > > I'm sure other people will find the functionality useful going forward
> > > > though and extend it to be triggered by DT/ACPI in the future.
> > >
> > > Hmm, we have a use case for kdump, this maybe useful.  For example
> > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > > kernel we have to increase the crashkernel reserved size for the extra
> > > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > > swiotlb region and pass the addr to kdump kernel.
> > >
> > 
> > Yes, definitely helpful for kdump kernel. This can help reduce the
> > crashkernel value.
> > 
> > Previously I was thinking about something similar, play around the
> > e820 entry passed to kdump and let it place swiotlb in wanted region.
> > Simply remap it like in this patch looks much cleaner.
> > 
> > If this patch is acceptable, one more patch is needed to expose the
> > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> > second kernel.
> 
> We seem to be passsing a lot of data to kexec.. Perhaps something
> of a unified way since we seem to have a lot of things to pass - disabling
> IOMMU, ACPI RSDT address, and then this.

acpi_rsdp kernel cmdline is not useful anymore.  The initial purpose is
for kexec-rebooting in efi system.  But now we have supported efi boot
across kexec reboot thus that is useless now.

Thanks
Dave

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-28 Thread Dave Young
On 03/26/20 at 05:29pm, Alexander Graf wrote:
> The swiotlb is a very convenient fallback mechanism for bounce buffering of
> DMAable data. It is usually used for the compatibility case where devices
> can only DMA to a "low region".
> 
> However, in some scenarios this "low region" may be bound even more
> heavily. For example, there are embedded system where only an SRAM region
> is shared between device and CPU. There are also heterogeneous computing
> scenarios where only a subset of RAM is cache coherent between the
> components of the system. There are partitioning hypervisors, where
> a "control VM" that implements device emulation has limited view into a
> partition's memory for DMA capabilities due to safety concerns.
> 
> This patch adds a command line driven mechanism to move all DMA memory into
> a predefined shared memory region which may or may not be part of the
> physical address layout of the Operating System.
> 
> Ideally, the typical path to set this configuration would be through Device
> Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> instead configure the system purely through kernel command line options.
> 
> I'm sure other people will find the functionality useful going forward
> though and extend it to be triggered by DT/ACPI in the future.

Hmm, we have a use case for kdump, this maybe useful.  For example
swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
kernel we have to increase the crashkernel reserved size for the extra
swiotlb requirement.  I wonder if we can just reuse the old kernel's
swiotlb region and pass the addr to kdump kernel.

> 
> Signed-off-by: Alexander Graf 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  3 +-
>  Documentation/x86/x86_64/boot-options.rst   |  4 ++-
>  kernel/dma/swiotlb.c| 46 
> +++--
>  3 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index c07815d230bc..d085d55c3cbe 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4785,11 +4785,12 @@
>   it if 0 is given (See 
> Documentation/admin-guide/cgroup-v1/memory.rst)
>  
>   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
> - Format: {  | force | noforce }
> + Format: {  | force | noforce | addr= }
>-- Number of I/O TLB slabs
>   force -- force using of bounce buffers even if they
>wouldn't be automatically used by the kernel
>   noforce -- Never use bounce buffers (for debugging)
> + addr= -- Try to allocate SWIOTLB at defined 
> address
>  
>   switches=   [HW,M68k]
>  
> diff --git a/Documentation/x86/x86_64/boot-options.rst 
> b/Documentation/x86/x86_64/boot-options.rst
> index 2b98efb5ba7f..ca46c57b68c9 100644
> --- a/Documentation/x86/x86_64/boot-options.rst
> +++ b/Documentation/x86/x86_64/boot-options.rst
> @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART hardware 
> IOMMU:
>  iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
>  implementation:
>  
> -swiotlb=[,force]
> +swiotlb=[,force][,addr=]
>
>  Prereserve that many 128K pages for the software IO bounce buffering.
>force
>  Force all IO through the software TLB.
> +  addr=
> +Try to allocate SWIOTLB at defined address
>  
>  Settings for the IBM Calgary hardware IOMMU currently found in IBM
>  pSeries and xSeries machines
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c19379fabd20..83da0caa2f93 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -46,6 +46,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -102,6 +103,12 @@ unsigned int max_segment;
>  #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
>  static phys_addr_t *io_tlb_orig_addr;
>  
> +/*
> + * The TLB phys addr may be defined on the command line. Store it here if it 
> is.
> + */
> +static phys_addr_t io_tlb_addr = INVALID_PHYS_ADDR;
> +
> +
>  /*
>   * Protect the above data structures in the map and unmap calls
>   */
> @@ -119,11 +126,23 @@ setup_io_tlb_npages(char *str)
>   }
>   if (*str == ',')
>   ++str;
> - if (!strcmp(str, "force")) {
> + if (!strncmp(str, "force", 5)) {
>   swiotlb_force = SWIOTLB_FORCE;
> - } else if (!strcmp(str, "noforce")) {
> + str += 5;
> + } else if (!strncmp(str, "noforce", 7)) {
>   swiotlb_force = SWIOTLB_NO_FORCE;
>   io_tlb_nslabs = 1;
> + str += 7;
> + }
> +
> + if (*str == ',')
> + 

Re: Crash kernel with 256 MB reserved memory runs into OOM condition

2019-08-12 Thread Dave Young
On 08/13/19 at 10:46am, Dave Young wrote:
> Add more cc.
> On 08/13/19 at 10:43am, Dave Young wrote:
> > Hi,
> > 
> > On 08/12/19 at 11:50am, Michal Hocko wrote:
> > > On Mon 12-08-19 11:42:33, Paul Menzel wrote:
> > > > Dear Linux folks,
> > > > 
> > > > 
> > > > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and
> > > > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes.
> > > > 
> > > > Please find the messages of the normal and the crash kernel attached.
> > > 
> > > You will need more memory to reserve for the crash kernel because ...
> > > 
> > > > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB 
> > > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> > > > unevictable:0kB writepending:0kB present:568kB managed:484kB 
> > > > mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB 
> > > > local_pcp:0kB free_cma:0kB
> > > > [4.573612] lowmem_reserve[]: 0 125 125 125
> > > > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB 
> > > > high:2140kB active_anon:0kB inactive_anon:0kB active_file:0kB 
> > > > inactive_file:0kB unevictable:15720kB writepending:0kB present:261560kB 
> > > > managed:133752kB mlocked:0kB kernel_stack:2496kB pagetables:0kB 
> > > > bounce:0kB free_pcp:212kB local_pcp:212kB free_cma:0kB
> > > 
> > > ... the memory is really depleted and nothing to be reclaimed (no anon.
> > > file pages) Look how tht free memory is below min watermark (node zone 
> > > DMA has
> > > lowmem protection for GFP_KERNEL allocation).
> > 
> > We found similar issue on our side while working on kdump on SME enabled
> > systemd.  Kairui is working on some patches.
> > 
> > Actually on those SME/SEV enabled machines, swiotlb is enabled
> > automatically so at least we need extra 64M+ memory for kdump other
> > than the normal expectation.
> > 
> > Can you check if this is also your case?
> 
> The question is to Paul,  also it would be always good to cc kexec mail
> list for kexec and kdump issues.

Looks like hardware iommu is used, maybe you do not enable SME?

Also replace maxcpus=1 with nr_cpus=1 can save some memory, can have a
try.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Crash kernel with 256 MB reserved memory runs into OOM condition

2019-08-12 Thread Dave Young
Add more cc.
On 08/13/19 at 10:43am, Dave Young wrote:
> Hi,
> 
> On 08/12/19 at 11:50am, Michal Hocko wrote:
> > On Mon 12-08-19 11:42:33, Paul Menzel wrote:
> > > Dear Linux folks,
> > > 
> > > 
> > > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and
> > > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes.
> > > 
> > > Please find the messages of the normal and the crash kernel attached.
> > 
> > You will need more memory to reserve for the crash kernel because ...
> > 
> > > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB 
> > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> > > unevictable:0kB writepending:0kB present:568kB managed:484kB mlocked:0kB 
> > > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
> > > free_cma:0kB
> > > [4.573612] lowmem_reserve[]: 0 125 125 125
> > > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB high:2140kB 
> > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> > > unevictable:15720kB writepending:0kB present:261560kB managed:133752kB 
> > > mlocked:0kB kernel_stack:2496kB pagetables:0kB bounce:0kB free_pcp:212kB 
> > > local_pcp:212kB free_cma:0kB
> > 
> > ... the memory is really depleted and nothing to be reclaimed (no anon.
> > file pages) Look how tht free memory is below min watermark (node zone DMA 
> > has
> > lowmem protection for GFP_KERNEL allocation).
> 
> We found similar issue on our side while working on kdump on SME enabled
> systemd.  Kairui is working on some patches.
> 
> Actually on those SME/SEV enabled machines, swiotlb is enabled
> automatically so at least we need extra 64M+ memory for kdump other
> than the normal expectation.
> 
> Can you check if this is also your case?

The question is to Paul,  also it would be always good to cc kexec mail
list for kexec and kdump issues.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Crash kernel with 256 MB reserved memory runs into OOM condition

2019-08-12 Thread Dave Young
Hi,

On 08/12/19 at 11:50am, Michal Hocko wrote:
> On Mon 12-08-19 11:42:33, Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and
> > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes.
> > 
> > Please find the messages of the normal and the crash kernel attached.
> 
> You will need more memory to reserve for the crash kernel because ...
> 
> > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB 
> > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> > unevictable:0kB writepending:0kB present:568kB managed:484kB mlocked:0kB 
> > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
> > free_cma:0kB
> > [4.573612] lowmem_reserve[]: 0 125 125 125
> > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB high:2140kB 
> > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> > unevictable:15720kB writepending:0kB present:261560kB managed:133752kB 
> > mlocked:0kB kernel_stack:2496kB pagetables:0kB bounce:0kB free_pcp:212kB 
> > local_pcp:212kB free_cma:0kB
> 
> ... the memory is really depleted and nothing to be reclaimed (no anon.
> file pages) Look how tht free memory is below min watermark (node zone DMA has
> lowmem protection for GFP_KERNEL allocation).

We found similar issue on our side while working on kdump on SME enabled
systemd.  Kairui is working on some patches.

Actually on those SME/SEV enabled machines, swiotlb is enabled
automatically so at least we need extra 64M+ memory for kdump other
than the normal expectation.

Can you check if this is also your case?

Thanks
Dave


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-03-22 Thread Dave Young
Hi Pingfan, 

Thanks for the effort,
On 03/01/19 at 11:19am, Pingfan Liu wrote:
> On Fri, Mar 1, 2019 at 11:04 AM Pingfan Liu  wrote:
> >
> > Hi Borislav,
> >
> > Do you think the following patch is good at present?
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 81f9d23..9213073 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -460,7 +460,7 @@ static void __init
> > memblock_x86_reserve_range_setup_data(void)
> >  # define CRASH_ADDR_LOW_MAX(512 << 20)
> >  # define CRASH_ADDR_HIGH_MAX   (512 << 20)
> >  #else
> > -# define CRASH_ADDR_LOW_MAX(896UL << 20)
> > +# define CRASH_ADDR_LOW_MAX(1 << 32)
> >  # define CRASH_ADDR_HIGH_MAX   MAXMEM
> >  #endif
> >
> Or patch lools like:
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3d872a5..ed0def5 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -459,7 +459,7 @@ static void __init
> memblock_x86_reserve_range_setup_data(void)
>  # define CRASH_ADDR_LOW_MAX(512 << 20)
>  # define CRASH_ADDR_HIGH_MAX   (512 << 20)
>  #else
> -# define CRASH_ADDR_LOW_MAX(896UL << 20)
> +# define CRASH_ADDR_LOW_MAX(1 << 32)
>  # define CRASH_ADDR_HIGH_MAX   MAXMEM
>  #endif
> 
> @@ -551,6 +551,15 @@ static void __init reserve_crashkernel(void)
> high ? CRASH_ADDR_HIGH_MAX
>  : CRASH_ADDR_LOW_MAX,
> crash_size, CRASH_ALIGN);
> +#ifdef CONFIG_X86_64
> +   /*
> +* crashkernel=X reserve below 4G fails? Try MAXMEM
> +*/
> +   if (!high && !crash_base)
> +   crash_base = memblock_find_in_range(CRASH_ALIGN,
> +   CRASH_ADDR_HIGH_MAX,
> +   crash_size, CRASH_ALIGN);
> +#endif
> 
> which tries 0-4G, the fall back to 4G above

This way looks good to me, I will do some testing with old kexec-tools,
Once testing done I can take up this again and repost later with some 
documentation
update.  Also will split to 2 patches  one to drop the old limitation,
another for the fallback.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-25 Thread Dave Young
On 02/25/19 at 12:00pm, Joerg Roedel wrote:
> On Fri, Feb 22, 2019 at 02:00:26PM +0100, Borislav Petkov wrote:
> > On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote:
> > > The current default of 256MB was found by experiments on a bigger
> > > number of machines, to create a reasonable default that is at least
> > > likely to be sufficient of an average machine.
> > 
> > Exactly, and this is what makes sense.
> > 
> > The code should try the requested reservation and if it fails, it should
> > try high allocation with default swiotlb size because we need to reserve
> > *some* range.
> 
> Right, makes sense. While at it, maybe it is time to move the default
> allocation policy to 'high' again. The change was reverted six years ago
> because it broke old kexec tools, but those are probably out-of-service
> now. I think this change would make the whole crashdump allocation
> process less fragile.

One concern about this is for average cases, one do not need so much
memory for kdump.  For example in RHEL we use crashkernel=auto to
automatically reserve kdump kernel memory, and for x86 the reserved size
is like below now:
1G-64G:160M,64G-1T:256M,1T-:512M

That means for a machine with less than 64G memory we only allocate
160M, it works for most machines in our lab.

If we move to high as default, it will allocate 160M high + 256M low. It
is too much for people who is good with the default 160M.  Especially
for virtual machine with less memory (but > 4G)

To make the process less fragile maybe we can remove the 896M limitation
and only try <4G then go to high.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-24 Thread Dave Young
On 02/24/19 at 09:25pm, Pingfan Liu wrote:
> On Fri, Feb 22, 2019 at 9:00 PM Borislav Petkov  wrote:
> >
> > On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote:
> > > The current default of 256MB was found by experiments on a bigger
> > > number of machines, to create a reasonable default that is at least
> > > likely to be sufficient of an average machine.
> >
> > Exactly, and this is what makes sense.
> >
> > The code should try the requested reservation and if it fails, it should
> > try high allocation with default swiotlb size because we need to reserve
> > *some* range.
> >
> > If that reservation succeeds, we should say something along the lines of
> >
> > "... requested range failed, reserved  range instead."
> >
> Maybe I misunderstood you, but does "requested range failed" mean that
> user specify the range? If yes, then it should be the duty of user as
> you said later, not the duty of kernel"

If you go with the changes in your current patch it is needed to say
something like:
"crashkernel: can not find free memory under 4G, reserve XM@.. instead" 

Also need to print the reserved low memory area in case ,high being used.

But for 896M -> 4G, the 896M faulure is not necessary to show in dmesg,
it is some in kernel logic.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-21 Thread Dave Young
On 02/21/19 at 06:13pm, Borislav Petkov wrote:
> On Wed, Feb 20, 2019 at 05:41:46PM +0800, Dave Young wrote:
> > Previously Joerg posted below patch, maybe he has some idea. Joerg?
> 
> Isn't it clear from the commit message?

Then, does it answered your question?

256M is set as a default value in the patch, but it is not a predict to
satisfy all use cases, from the description it is also possible that some
people run out of the 256M and the ,low and ,high format is still
necessary to exist even if we make crashkernel=X do the allocation
automatically in high in case failed in low area.

crashkernel=X:  allocate in low first, if not possible, then allocate in
high

In case people have a lot of devices need more swiotlb, then he manually
set the ,high with ,low together.

What's your suggestion then?  remove ,low and ,high and increase default
256M in case we get failure bug report?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-20 Thread Dave Young
On 02/20/19 at 09:32am, Borislav Petkov wrote:
> On Mon, Feb 18, 2019 at 09:48:20AM +0800, Dave Young wrote:
> > It is ideal if kernel can do it automatically, but I'm not sure if
> > kernel can predict the swiotlb reserved size automatically.
> 
> Do you see how even more absurd this gets?
> 
> If the kernel cannot know the swiotlb reserved size automatically, how
> is the normal user even supposed to know?!
> 
> I see swiotlb_size_or_default() so we have a sane default which we fall
> back to. Now where's the problem with that?

Good question, I expect some answer from people who know more about the
background.  It would be good to have some actual test results, Pingfan
is trying to do some tests.

Previously Joerg posted below patch, maybe he has some idea. Joerg?

commit 94fb9334182284e8e7e4bcb9125c25dc33af19d4
Author: Joerg Roedel 
Date:   Wed Jun 10 17:49:42 2015 +0200

x86/crash: Allocate enough low memory when crashkernel=high

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-17 Thread Dave Young
On 02/15/19 at 11:24am, Borislav Petkov wrote:
> On Tue, Feb 12, 2019 at 04:48:16AM +0800, Dave Young wrote:
> > Even we make it automatic in kernel, but we have to have some default
> > value for swiotlb in case crashkernel can not find a free region under 4G.
> > So this default value can not work for every use cases, people need 
> > manually use crashkernel=,low and crashkernel=,high in case
> > crashkernel=X does not work.
> 
> Why would the user need to find swiotlb range? The kernel has all the
> information it requires at its finger tips in order to decide properly.
> 
> The user wants a crashkernel range, the kernel tries the low range =>
> no workie, then it tries the next range => workie but needs to allocate
> swiotlb range so that DMA can happen too. Doh, then the kernel does
> allocate that too.

It is ideal if kernel can do it automatically, but I'm not sure if
kernel can predict the swiotlb reserved size automatically.

Let's add more people to seek for comments. 

> 
> Why would the user need to do anything here?!
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask

2018-09-05 Thread Dave Young
[snip]
> 
> As previously mentioned, there are also many differences between kexec and 
> kdump. In general,
> kexec needs to look at all of available physical memory, but kdump doesn't 
> need.
> 
> For kexec, kexec-tools will read /sys/firmware/memmap and recreate the e820 
> ranges for the 2nd
> kernel. If it fails, will use /proc/iomem.
> 
> For kdump, kexec-tools will read /proc/iomem and recreate the e820 ranges for 
> kdump kernel.
> BTW: we can not get the range of persistent memory from /proc/iomem. So e820 
> ranges don't contain
> the persistent memory in kdump kernel, this is the real reason why i need to 
> strengthen the logic
> of adjusting memory encryption mask.

"persistent memory" is different, I think you meant about some reserved
memory instead

> 
> If kexec-tools also use /sys/firmware/memmap for kdump(like kexec), kdump 
> kernel can also work
> without a fix, but the kexec-tools will have to be modified. Are you sure 
> that you want me to
> fix kexec-tools instead of kernel?

Yes, please fix kexec-tools to pass reserved ranges in e820, you will
not need this patch then.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask

2018-09-03 Thread Dave Young
On 09/04/18 at 09:29am, Dave Young wrote:
> On 09/04/18 at 08:44am, Dave Young wrote:
> > On 09/03/18 at 10:06pm, lijiang wrote:
> > > 在 2018年09月03日 10:45, Dave Young 写道:
> > > > On 08/31/18 at 04:19pm, Lianbo Jiang wrote:
> > > >> For kdump kernel, when SME is enabled, the acpi table and dmi table 
> > > >> will need
> > > >> to be remapped without the memory encryption mask. So we have to 
> > > >> strengthen
> > > >> the logic in early_memremap_pgprot_adjust(), which makes us have an 
> > > >> opportunity
> > > >> to adjust the memory encryption mask.
> > > >>
> > > >> Signed-off-by: Lianbo Jiang 
> > > >> ---
> > > >>  arch/x86/mm/ioremap.c | 9 -
> > > >>  1 file changed, 8 insertions(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> > > >> index e01e6c695add..f9d9a39955f3 100644
> > > >> --- a/arch/x86/mm/ioremap.c
> > > >> +++ b/arch/x86/mm/ioremap.c
> > > >> @@ -689,8 +689,15 @@ pgprot_t __init 
> > > >> early_memremap_pgprot_adjust(resource_size_t phys_addr,
> > > >>encrypted_prot = true;
> > > >>  
> > > >>if (sme_active()) {
> > > >> +/*
> > > >> + * In kdump kernel, the acpi table and dmi table will 
> > > >> need
> > > >> + * to be remapped without the memory encryption mask. 
> > > >> Here
> > > >> + * we have to strengthen the logic to adjust the 
> > > >> memory
> > > >> + * encryption mask.
> > > > 
> > > > Assume the acpi/dmi tables are identical for both 1st kernel and kdump
> > > > kernel, I'm not sure what is the difference, why need special handling
> > > > for kdump. Can you add more explanations?
> > > > 
> > > 
> > > Ok, i will use a dmi example to explain this issue.
> > > 
> > > There are significant differences about E820 between the 1st kernel and 
> > > kdump kernel. I pasted them at bottom.
> > > 
> > > Firstly, we need to know how they are called.
> > > __acpi_map_table()\   
> > >  / early_memremap_is_setup_data()
> > >|-> early_memremap()-> 
> > > early_memremap_pgprot_adjust()-> | memremap_is_efi_data()
> > >  dmi_early_remap()/   
> > >  \ memremap_should_map_decrypted()-> e820__get_entry_type()
> > > 
> > > Secondly, we also need to understand the memremap_should_map_decrypted(), 
> > > which is illustrated by the fake code.
> > > static bool memremap_should_map_decrypted(resource_size_t phys_addr,
> > >   unsigned long size)
> > > {
> > > 
> > > /* code ... */
> > > 
> > > switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
> > > case E820_TYPE_RESERVED:
> > > case E820_TYPE_ACPI:
> > > case E820_TYPE_NVS:
> > > case E820_TYPE_UNUSABLE:
> > > /* For SEV, these areas are encrypted */
> > > if (sev_active())
> > > break;
> > > /* Fallthrough */
> > > 
> > > case E820_TYPE_PRAM:
> > > /* For SME, these areas are decrypted */
> > > return true;
> > > default:
> > > /* these areas are encrypted by default*/
> > > break;
> > > }
> > > 
> > > return false;
> > > }
> > > 
> > > For the dmi case, the dmi base address is 0x6286b000 in my test machine.
> > > 
> > > In the 1st kernel, the e820__get_entry_type() can get a valid entry and 
> > > type by the dmi address, and we can also find the dmi base address from 
> > > e820.
> > > (see the 1st kernel log)
> > > 0x6286b000 ∈ [mem 0x6286b000-0x6286efff]
> > > So, these areas are decrypted according to the 
> > > memremap_should_map_decrypted().
> > > 
> > > In kdump kernel, the dmi base address is still 0x6286b000, but we can not 
> > > find the dmi base address 

Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask

2018-09-03 Thread Dave Young
On 09/04/18 at 08:44am, Dave Young wrote:
> On 09/03/18 at 10:06pm, lijiang wrote:
> > 在 2018年09月03日 10:45, Dave Young 写道:
> > > On 08/31/18 at 04:19pm, Lianbo Jiang wrote:
> > >> For kdump kernel, when SME is enabled, the acpi table and dmi table will 
> > >> need
> > >> to be remapped without the memory encryption mask. So we have to 
> > >> strengthen
> > >> the logic in early_memremap_pgprot_adjust(), which makes us have an 
> > >> opportunity
> > >> to adjust the memory encryption mask.
> > >>
> > >> Signed-off-by: Lianbo Jiang 
> > >> ---
> > >>  arch/x86/mm/ioremap.c | 9 -
> > >>  1 file changed, 8 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> > >> index e01e6c695add..f9d9a39955f3 100644
> > >> --- a/arch/x86/mm/ioremap.c
> > >> +++ b/arch/x86/mm/ioremap.c
> > >> @@ -689,8 +689,15 @@ pgprot_t __init 
> > >> early_memremap_pgprot_adjust(resource_size_t phys_addr,
> > >>  encrypted_prot = true;
> > >>  
> > >>  if (sme_active()) {
> > >> +/*
> > >> + * In kdump kernel, the acpi table and dmi table will 
> > >> need
> > >> + * to be remapped without the memory encryption mask. 
> > >> Here
> > >> + * we have to strengthen the logic to adjust the memory
> > >> + * encryption mask.
> > > 
> > > Assume the acpi/dmi tables are identical for both 1st kernel and kdump
> > > kernel, I'm not sure what is the difference, why need special handling
> > > for kdump. Can you add more explanations?
> > > 
> > 
> > Ok, i will use a dmi example to explain this issue.
> > 
> > There are significant differences about E820 between the 1st kernel and 
> > kdump kernel. I pasted them at bottom.
> > 
> > Firstly, we need to know how they are called.
> > __acpi_map_table()\
> > / early_memremap_is_setup_data()
> >|-> early_memremap()-> early_memremap_pgprot_adjust()-> 
> > | memremap_is_efi_data()
> >  dmi_early_remap()/
> > \ memremap_should_map_decrypted()-> e820__get_entry_type()
> > 
> > Secondly, we also need to understand the memremap_should_map_decrypted(), 
> > which is illustrated by the fake code.
> > static bool memremap_should_map_decrypted(resource_size_t phys_addr,
> >   unsigned long size)
> > {
> > 
> > /* code ... */
> > 
> > switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
> > case E820_TYPE_RESERVED:
> > case E820_TYPE_ACPI:
> > case E820_TYPE_NVS:
> > case E820_TYPE_UNUSABLE:
> > /* For SEV, these areas are encrypted */
> > if (sev_active())
> > break;
> > /* Fallthrough */
> > 
> > case E820_TYPE_PRAM:
> > /* For SME, these areas are decrypted */
> > return true;
> > default:
> > /* these areas are encrypted by default*/
> > break;
> > }
> > 
> > return false;
> > }
> > 
> > For the dmi case, the dmi base address is 0x6286b000 in my test machine.
> > 
> > In the 1st kernel, the e820__get_entry_type() can get a valid entry and 
> > type by the dmi address, and we can also find the dmi base address from 
> > e820.
> > (see the 1st kernel log)
> > 0x6286b000 ∈ [mem 0x6286b000-0x6286efff]
> > So, these areas are decrypted according to the 
> > memremap_should_map_decrypted().
> > 
> > In kdump kernel, the dmi base address is still 0x6286b000, but we can not 
> > find the dmi base address from e820 any more. The e820__get_entry_type() can
> > not get a valid entry and type by the dmi base address, it will go into the 
> > default branch. That is to say, these areas become encrypted. In fact, these
> > areas are also decrypted, so we have to strengthen the logic of adjusting 
> > the memory encryption mask.
> > 
> > 
> > The 1st kernel log:
> > 
> > [0.00] BIOS-provided physical RAM map:
> > [0.00] BIOS-e820: [mem 0x-0x0

Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask

2018-09-03 Thread Dave Young
On 09/03/18 at 10:06pm, lijiang wrote:
> 在 2018年09月03日 10:45, Dave Young 写道:
> > On 08/31/18 at 04:19pm, Lianbo Jiang wrote:
> >> For kdump kernel, when SME is enabled, the acpi table and dmi table will 
> >> need
> >> to be remapped without the memory encryption mask. So we have to strengthen
> >> the logic in early_memremap_pgprot_adjust(), which makes us have an 
> >> opportunity
> >> to adjust the memory encryption mask.
> >>
> >> Signed-off-by: Lianbo Jiang 
> >> ---
> >>  arch/x86/mm/ioremap.c | 9 -
> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> >> index e01e6c695add..f9d9a39955f3 100644
> >> --- a/arch/x86/mm/ioremap.c
> >> +++ b/arch/x86/mm/ioremap.c
> >> @@ -689,8 +689,15 @@ pgprot_t __init 
> >> early_memremap_pgprot_adjust(resource_size_t phys_addr,
> >>encrypted_prot = true;
> >>  
> >>if (sme_active()) {
> >> +/*
> >> + * In kdump kernel, the acpi table and dmi table will need
> >> + * to be remapped without the memory encryption mask. Here
> >> + * we have to strengthen the logic to adjust the memory
> >> + * encryption mask.
> > 
> > Assume the acpi/dmi tables are identical for both 1st kernel and kdump
> > kernel, I'm not sure what is the difference, why need special handling
> > for kdump. Can you add more explanations?
> > 
> 
> Ok, i will use a dmi example to explain this issue.
> 
> There are significant differences about E820 between the 1st kernel and kdump 
> kernel. I pasted them at bottom.
> 
> Firstly, we need to know how they are called.
> __acpi_map_table()\/ 
> early_memremap_is_setup_data()
>|-> early_memremap()-> early_memremap_pgprot_adjust()-> | 
> memremap_is_efi_data()
>  dmi_early_remap()/\ 
> memremap_should_map_decrypted()-> e820__get_entry_type()
> 
> Secondly, we also need to understand the memremap_should_map_decrypted(), 
> which is illustrated by the fake code.
> static bool memremap_should_map_decrypted(resource_size_t phys_addr,
>   unsigned long size)
> {
> 
> /* code ... */
> 
> switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
> case E820_TYPE_RESERVED:
> case E820_TYPE_ACPI:
> case E820_TYPE_NVS:
> case E820_TYPE_UNUSABLE:
> /* For SEV, these areas are encrypted */
> if (sev_active())
> break;
> /* Fallthrough */
> 
> case E820_TYPE_PRAM:
> /* For SME, these areas are decrypted */
> return true;
> default:
> /* these areas are encrypted by default*/
> break;
> }
> 
> return false;
> }
> 
> For the dmi case, the dmi base address is 0x6286b000 in my test machine.
> 
> In the 1st kernel, the e820__get_entry_type() can get a valid entry and type 
> by the dmi address, and we can also find the dmi base address from e820.
> (see the 1st kernel log)
> 0x6286b000 ∈ [mem 0x6286b000-0x6286efff]
> So, these areas are decrypted according to the 
> memremap_should_map_decrypted().
> 
> In kdump kernel, the dmi base address is still 0x6286b000, but we can not 
> find the dmi base address from e820 any more. The e820__get_entry_type() can
> not get a valid entry and type by the dmi base address, it will go into the 
> default branch. That is to say, these areas become encrypted. In fact, these
> areas are also decrypted, so we have to strengthen the logic of adjusting the 
> memory encryption mask.
> 
> 
> The 1st kernel log:
> 
> [0.00] BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x0008bfff] usable
> [0.00] BIOS-e820: [mem 0x0008c000-0x0009] reserved
> [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
> [0.00] BIOS-e820: [mem 0x0010-0x29920fff] usable
> [0.00] BIOS-e820: [mem 0x29921000-0x29921fff] reserved
> [0.00] BIOS-e820: [mem 0x29922000-0x62256fff] usable
> [0.00] BIOS-e820: [mem 0x62257000-0x62356fff] reserved
> [0.00] BIOS-e820: [mem 0x62357000-0x6235cfff] ACPI 
> data
>

Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask

2018-09-02 Thread Dave Young
On 08/31/18 at 04:19pm, Lianbo Jiang wrote:
> For kdump kernel, when SME is enabled, the acpi table and dmi table will need
> to be remapped without the memory encryption mask. So we have to strengthen
> the logic in early_memremap_pgprot_adjust(), which makes us have an 
> opportunity
> to adjust the memory encryption mask.
> 
> Signed-off-by: Lianbo Jiang 
> ---
>  arch/x86/mm/ioremap.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index e01e6c695add..f9d9a39955f3 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -689,8 +689,15 @@ pgprot_t __init 
> early_memremap_pgprot_adjust(resource_size_t phys_addr,
>   encrypted_prot = true;
>  
>   if (sme_active()) {
> +/*
> + * In kdump kernel, the acpi table and dmi table will need
> + * to be remapped without the memory encryption mask. Here
> + * we have to strengthen the logic to adjust the memory
> + * encryption mask.

Assume the acpi/dmi tables are identical for both 1st kernel and kdump
kernel, I'm not sure what is the difference, why need special handling
for kdump. Can you add more explanations?

> + */
>   if (early_memremap_is_setup_data(phys_addr, size) ||
> - memremap_is_efi_data(phys_addr, size))
> + memremap_is_efi_data(phys_addr, size) ||
> + is_kdump_kernel())
>   encrypted_prot = false;
>   }
>  
> -- 
> 2.17.1
> 

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5 V5] Add a function(ioremap_encrypted) for kdump when AMD sme enabled

2018-07-19 Thread Dave Young
On 07/20/18 at 01:06pm, lijiang wrote:
> 在 2018年07月14日 01:08, Borislav Petkov 写道:
> > On Mon, Jul 09, 2018 at 09:55:35PM +0800, lijiang wrote:
> >> About this issue, i want to use an example to describe it.
> >> /* drivers/iommu/amd_iommu_init.c */
> >> static u8 __iomem * __init iommu_map_mmio_space(u64 address, u64 end)
> > 
> > Those addresses come from the IVHD header which is an ACPI table. So the
> > dump kernel can find that out too.
> > Sure. I might understand your means, that will have to find all address out 
> > in
> order to cover any cases in kdump kernel, those address  might include MMIO
> space, HPET, ACPI device table, ERST, and so on... 
> 
> >> Obviously, the iommu mmio space is not encrypted, and the device
> >> mmio space is outside kdump kernel. We know that the old memory is
> >> encrypted, and the old memory is also outside kdump kernel. For the
> >> current case, e820__get_entry_type() and walk_iomem_res_desc() can't
> >> get the desired result, so we can't also decide whether encryption
> >> or not according to this result(rules). If we want to know whether
> >> encryption or not by deducing the address, we will need to read the
> >> content of memory and have a reference value for comparison, then
> >> what's a reference value? Sometimes we don't know that.
> > 
> > Again, if we don't know that how is the *caller* supposed to know
> > whether the memory is encrypted or not? Because
> > 
> > "we" == "caller"
> > 
> > in the kdump kernel.
> > 
> > And the more important question is, why are we dumping MMIO space of the
> > previous kernel *at* *all*? That doesn't make any sense to me.
> > 
> Sorry for my late reply.
> Here, it doesn't need to dump MMIO space of the previous kernel, when the
> kdump kernel boot, the MMIO address will be remapped in decryption manners,
> but the MMIO address don't belong to the range of the crash reserved memory,
> for the kdump kernel, the MMIO space(address) and IOMMU device table(address)
> are outside address, whereas, the IOMMU device table is encrypted in the first
> kernel, the kdump kernel will need to copy the content of IOMMU device table
> from the first kernel when the kdump kernel boot, so the IOMMU device table 
> will
> be remapped in encryption manners.
> So some of them require to be remapped in encryption manners, and 
> some(address)
> require to be remapped in decryption manners.
> 

There could be some misunderstanding here.  From the code
copy_device_table in amd_iommu_init.c,  iommu table entry is retrieved
by read mmio address, then use memremap to map the entry address for
copying the device table, so the thing related to your patch is the dev
table entry address not the mmio address.

As for why need copy the old dev table, I think it is for addressing
on-flight DMA issue,  just like the git log of below commit although the
commit is for Intel IOMMU but I think AMD IOMMU solution is similar:

commit 091d42e43d21b6ca7ec39bf5f9e17bc0bd8d4312
Author: Joerg Roedel 
Date:   Fri Jun 12 11:56:10 2015 +0200

iommu/vt-d: Copy translation tables from old kernel

If we are in a kdump kernel and find translation enabled in
the iommu, try to copy the translation tables from the old
kernel to preserve the mappings until the device driver
takes over.

Baoquan knows more about this I think he can correct if I'm wrong.

Thanks
Dave

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 4/4 V3] Help to dump the old memory encrypted into vmcore file

2018-06-18 Thread Dave Young
On 06/16/18 at 04:27pm, Lianbo Jiang wrote:
> In kdump mode, we need to dump the old memory into vmcore file,
> if SME is enabled in the first kernel, we must remap the old
> memory in encrypted manner, which will be automatically decrypted
> when we read from DRAM. It helps to parse the vmcore for some tools.
> 
> Signed-off-by: Lianbo Jiang 
> ---
> Some changes:
> 1. add a new file and modify Makefile.
> 2. remove some code in sev_active().
> 
>  arch/x86/kernel/Makefile |  1 +
>  arch/x86/kernel/crash_dump_encrypt.c | 53 
> 
>  fs/proc/vmcore.c | 20 ++
>  include/linux/crash_dump.h   | 11 
>  4 files changed, 79 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/kernel/crash_dump_encrypt.c
> 
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 02d6f5c..afb5bad 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_KEXEC_CORE)+= machine_kexec_$(BITS).o
>  obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o
>  obj-$(CONFIG_KEXEC_FILE) += kexec-bzimage64.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
> +obj-$(CONFIG_AMD_MEM_ENCRYPT)+= crash_dump_encrypt.o
>  obj-y+= kprobes/
>  obj-$(CONFIG_MODULES)+= module.o
>  obj-$(CONFIG_DOUBLEFAULT)+= doublefault.o
> diff --git a/arch/x86/kernel/crash_dump_encrypt.c 
> b/arch/x86/kernel/crash_dump_encrypt.c
> new file mode 100644
> index 000..e44ef33
> --- /dev/null
> +++ b/arch/x86/kernel/crash_dump_encrypt.c
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + *   Memory preserving reboot related code.
> + *
> + *   Created by: Lianbo Jiang (liji...@redhat.com)
> + *   Copyright (C) RedHat Corporation, 2018. All rights reserved
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * copy_oldmem_page_encrypted - copy one page from "oldmem encrypted"
> + * @pfn: page frame number to be copied
> + * @buf: target memory address for the copy; this can be in kernel address
> + *   space or user address space (see @userbuf)
> + * @csize: number of bytes to copy
> + * @offset: offset in bytes into the page (based on pfn) to begin the copy
> + * @userbuf: if set, @buf is in user address space, use copy_to_user(),
> + *   otherwise @buf is in kernel address space, use memcpy().
> + *
> + * Copy a page from "oldmem encrypted". For this page, there is no pte
> + * mapped in the current kernel. We stitch up a pte, similar to
> + * kmap_atomic.
> + */
> +
> +ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char *buf,
> + size_t csize, unsigned long offset, int userbuf)
> +{
> + void  *vaddr;
> +
> + if (!csize)
> + return 0;
> +
> + vaddr = (__force void *)ioremap_encrypted(pfn << PAGE_SHIFT,
> +   PAGE_SIZE);
> + if (!vaddr)
> + return -ENOMEM;
> +
> + if (userbuf) {
> + if (copy_to_user((void __user *)buf, vaddr + offset, csize)) {
> + iounmap((void __iomem *)vaddr);
> + return -EFAULT;
> + }
> + } else
> + memcpy(buf, vaddr + offset, csize);
> +
> + set_iounmap_nonlazy();
> + iounmap((void __iomem *)vaddr);
> + return csize;
> +}
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index a45f0af..5200266 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -25,6 +25,8 @@
>  #include 
>  #include 
>  #include "internal.h"
> +#include 
> +#include 
>  
>  /* List representing chunks of contiguous memory areas and their offsets in
>   * vmcore file.
> @@ -86,7 +88,8 @@ static int pfn_is_ram(unsigned long pfn)
>  
>  /* Reads a page from the oldmem device from given offset. */
>  static ssize_t read_from_oldmem(char *buf, size_t count,
> - u64 *ppos, int userbuf)
> + u64 *ppos, int userbuf,
> + bool encrypted)
>  {
>   unsigned long pfn, offset;
>   size_t nr_bytes;
> @@ -108,8 +111,11 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
>   if (pfn_is_ram(pfn) == 0)
>   memset(buf, 0, nr_bytes);
>   else {
> - tmp = copy_oldmem_page(pfn, buf, nr_bytes,
> - offset, userbuf);
> + tmp = encrypted ? copy_oldmem_page_encrypted(pfn,
> + buf, nr_bytes, offset, userbuf)
> + : copy_oldmem_page(pfn, buf, nr_bytes,
> +offset, userbuf);
> +
>   if (tmp < 0)
>   return tmp;
>   }
> @@ -143,7 +149,7 @@ void __weak elfcorehdr_free(unsigned long long addr)
>   */
>  

Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-05-25 Thread Dave Young
Ccing Xunlei he is reading the patches see what need to be done for
kdump. There should still be several places to handle to make kdump work.

On 05/18/17 at 07:01pm, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 04:22:12PM -0500, Tom Lendacky wrote:
> > Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> > determine if SME is active.
> 
> But why do user-space tools need to know that?
> 
> I mean, when we load the kdump kernel, we do it with the first kernel,
> with the kexec_load() syscall, AFAICT. And that code does a lot of
> things during that init, like machine_kexec_prepare()->init_pgtable() to
> prepare the ident mapping of the second kernel, for example.
> 
> What I'm aiming at is that the first kernel knows *exactly* whether SME
> is enabled or not and doesn't need to tell the second one through some
> sysfs entries - it can do that during loading.
> 
> So I don't think we need any userspace things at all...

If kdump kernel can get the SME status from hardware register then this
should be not necessary and this patch can be dropped.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-27 Thread Dave Young
On 04/27/17 at 08:52am, Dave Hansen wrote:
> On 04/27/2017 12:25 AM, Dave Young wrote:
> > On 04/21/17 at 02:55pm, Dave Hansen wrote:
> >> On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> >>> Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> >>> determine if SME is active.
> >>>
> >>> A new directory will be created:
> >>>   /sys/kernel/mm/sme/
> >>>
> >>> And two entries within the new directory:
> >>>   /sys/kernel/mm/sme/active
> >>>   /sys/kernel/mm/sme/encryption_mask
> >>
> >> Why do they care, and what will they be doing with this information?
> > 
> > Since kdump will copy old memory but need this to know if the old memory
> > was encrypted or not. With this sysfs file we can know the previous SME
> > status and pass to kdump kernel as like a kernel param.
> > 
> > Tom, have you got chance to try if it works or not?
> 
> What will the kdump kernel do with it though?  We kexec() into that
> kernel so the SME keys will all be the same, right?  So, will the kdump
> kernel be just setting the encryption bit in the PTE so it can copy the
> old plaintext out?

I assume it is for active -> non active case, the new boot need to know
the old memory is encrypted. But I think I did not read all the patches
I may miss things.

> 
> Why do we need both 'active' and 'encryption_mask'?  How could it be
> that the hardware-enumerated 'encryption_mask' changes across a kexec()?

Leave this question to Tom..

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-27 Thread Dave Young
On 04/21/17 at 02:55pm, Dave Hansen wrote:
> On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> > Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> > determine if SME is active.
> > 
> > A new directory will be created:
> >   /sys/kernel/mm/sme/
> > 
> > And two entries within the new directory:
> >   /sys/kernel/mm/sme/active
> >   /sys/kernel/mm/sme/encryption_mask
> 
> Why do they care, and what will they be doing with this information?

Since kdump will copy old memory but need this to know if the old memory
was encrypted or not. With this sysfs file we can know the previous SME
status and pass to kdump kernel as like a kernel param.

Tom, have you got chance to try if it works or not?

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 25/28] x86: Access the setup data through sysfs decrypted

2017-03-07 Thread Dave Young
On 02/16/17 at 09:47am, Tom Lendacky wrote:
> Use memremap() to map the setup data.  This will make the appropriate
> decision as to whether a RAM remapping can be done or if a fallback to
> ioremap_cache() is needed (similar to the setup data debugfs support).
> 
> Signed-off-by: Tom Lendacky <thomas.lenda...@amd.com>
> ---
>  arch/x86/kernel/ksysfs.c |   27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kernel/ksysfs.c b/arch/x86/kernel/ksysfs.c
> index 4afc67f..d653b3e 100644
> --- a/arch/x86/kernel/ksysfs.c
> +++ b/arch/x86/kernel/ksysfs.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -79,12 +80,12 @@ static int get_setup_data_paddr(int nr, u64 *paddr)
>   *paddr = pa_data;
>   return 0;
>   }
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   i++;
>   }
>   return -EINVAL;
> @@ -97,17 +98,17 @@ static int __init get_setup_data_size(int nr, size_t 
> *size)
>   u64 pa_data = boot_params.hdr.setup_data;
>  
>   while (pa_data) {
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>   if (nr == i) {
>   *size = data->len;
> - iounmap(data);
> + memunmap(data);
>   return 0;
>   }
>  
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   i++;
>   }
>   return -EINVAL;
> @@ -127,12 +128,12 @@ static ssize_t type_show(struct kobject *kobj,
>   ret = get_setup_data_paddr(nr, );
>   if (ret)
>   return ret;
> - data = ioremap_cache(paddr, sizeof(*data));
> + data = memremap(paddr, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
>   ret = sprintf(buf, "0x%x\n", data->type);
> - iounmap(data);
> + memunmap(data);
>   return ret;
>  }
>  
> @@ -154,7 +155,7 @@ static ssize_t setup_data_data_read(struct file *fp,
>   ret = get_setup_data_paddr(nr, );
>   if (ret)
>   return ret;
> - data = ioremap_cache(paddr, sizeof(*data));
> + data = memremap(paddr, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
> @@ -170,15 +171,15 @@ static ssize_t setup_data_data_read(struct file *fp,
>   goto out;
>  
>   ret = count;
> - p = ioremap_cache(paddr + sizeof(*data), data->len);
> + p = memremap(paddr + sizeof(*data), data->len, MEMREMAP_WB);
>   if (!p) {
>   ret = -ENOMEM;
>   goto out;
>   }
>   memcpy(buf, p + off, count);
> - iounmap(p);
> + memunmap(p);
>  out:
> - iounmap(data);
> + memunmap(data);
>   return ret;
>  }
>  
> @@ -250,13 +251,13 @@ static int __init get_setup_data_total_num(u64 pa_data, 
> int *nr)
>   *nr = 0;
>   while (pa_data) {
>   *nr += 1;
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data) {
>   ret = -ENOMEM;
>   goto out;
>   }
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   }
>  
>  out:
> 

It would be better that these cleanup patches are sent separately.

Acked-by: Dave Young <dyo...@redhat.com>

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 24/28] x86: Access the setup data through debugfs decrypted

2017-03-07 Thread Dave Young
On 02/16/17 at 09:47am, Tom Lendacky wrote:
> Use memremap() to map the setup data.  This simplifies the code and will
> make the appropriate decision as to whether a RAM remapping can be done
> or if a fallback to ioremap_cache() is needed (which includes checking
> PageHighMem).
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/kernel/kdebugfs.c |   30 +++---
>  1 file changed, 11 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c
> index bdb83e4..c3d354d 100644
> --- a/arch/x86/kernel/kdebugfs.c
> +++ b/arch/x86/kernel/kdebugfs.c
> @@ -48,17 +48,13 @@ static ssize_t setup_data_read(struct file *file, char 
> __user *user_buf,
>  
>   pa = node->paddr + sizeof(struct setup_data) + pos;
>   pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT);
> - if (PageHighMem(pg)) {
> - p = ioremap_cache(pa, count);
> - if (!p)
> - return -ENXIO;
> - } else
> - p = __va(pa);
> + p = memremap(pa, count, MEMREMAP_WB);
> + if (!p)
> + return -ENXIO;

-ENOMEM looks better for memremap, ditto for other places..

>  
>   remain = copy_to_user(user_buf, p, count);
>  
> - if (PageHighMem(pg))
> - iounmap(p);
> + memunmap(p);
>  
>   if (remain)
>   return -EFAULT;
> @@ -127,15 +123,12 @@ static int __init create_setup_data_nodes(struct dentry 
> *parent)
>   }
>  
>   pg = pfn_to_page((pa_data+sizeof(*data)-1) >> PAGE_SHIFT);
> - if (PageHighMem(pg)) {
> - data = ioremap_cache(pa_data, sizeof(*data));
> - if (!data) {
> - kfree(node);
> - error = -ENXIO;
> - goto err_dir;
> - }
> - } else
> - data = __va(pa_data);
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
> + if (!data) {
> + kfree(node);
> + error = -ENXIO;
> + goto err_dir;
> + }
>  
>   node->paddr = pa_data;
>   node->type = data->type;
> @@ -143,8 +136,7 @@ static int __init create_setup_data_nodes(struct dentry 
> *parent)
>   error = create_setup_data_node(d, no, node);
>   pa_data = data->next;
>  
> - if (PageHighMem(pg))
> - iounmap(data);
> + memunmap(data);
>   if (error)
>   goto err_dir;
>   no++;
> 

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 14/28] Add support to access boot related data in the clear

2017-03-07 Thread Dave Young
On 02/16/17 at 09:45am, Tom Lendacky wrote:
[snip]
> + * This function determines if an address should be mapped encrypted.
> + * Boot setup data, EFI data and E820 areas are checked in making this
> + * determination.
> + */
> +static bool memremap_should_map_encrypted(resource_size_t phys_addr,
> +   unsigned long size)
> +{
> + /*
> +  * SME is not active, return true:
> +  *   - For early_memremap_pgprot_adjust(), returning true or false
> +  * results in the same protection value
> +  *   - For arch_memremap_do_ram_remap(), returning true will allow
> +  * the RAM remap to occur instead of falling back to ioremap()
> +  */
> + if (!sme_active())
> + return true;

>From the function name shouldn't above be return false? 

> +
> + /* Check if the address is part of the setup data */
> + if (memremap_is_setup_data(phys_addr, size))
> + return false;
> +
> + /* Check if the address is part of EFI boot/runtime data */
> + switch (efi_mem_type(phys_addr)) {
> + case EFI_BOOT_SERVICES_DATA:
> + case EFI_RUNTIME_SERVICES_DATA:

Only these two types needed? I'm not sure about this, just bring up the
question.

> + return false;
> + default:
> + break;
> + }
> +
> + /* Check if the address is outside kernel usable area */
> + switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
> + case E820_TYPE_RESERVED:
> + case E820_TYPE_ACPI:
> + case E820_TYPE_NVS:
> + case E820_TYPE_UNUSABLE:
> + return false;
> + default:
> + break;
> + }
> +
> + return true;
> +}
> +

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME

2017-03-01 Thread Dave Young
Add kexec list..

On 03/01/17 at 05:25pm, Dave Young wrote:
> Hi Tom,
> 
> On 02/17/17 at 10:43am, Tom Lendacky wrote:
> > On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote:
> > > > Provide support so that kexec can be used to boot a kernel when SME is
> > > > enabled.
> > > 
> > > Is the point of kexec and kdump to ehh, dump memory ? But if the
> > > rest of the memory is encrypted you won't get much, will you?
> > 
> > Kexec can be used to reboot a system without going back through BIOS.
> > So you can use kexec without using kdump.
> > 
> > For kdump, just taking a quick look, the option to enable memory
> > encryption can be provided on the crash kernel command line and then
> 
> Is there a simple way to get the SME status? Probably add some sysfs
> file for this purpose.
> 
> > crash kernel can would be able to copy the memory decrypted if the
> > pagetable is set up properly. It looks like currently ioremap_cache()
> > is used to map the old memory page.  That might be able to be changed
> > to a memremap() so that the encryption bit is set in the mapping. That
> > will mean that memory that is not marked encrypted (EFI tables, swiotlb
> > memory, etc) would not be read correctly.
> 
> Manage to store info about those ranges which are not encrypted so that
> memremap can handle them?
> 
> > 
> > > 
> > > Would it make sense to include some printk to the user if they
> > > are setting up kdump that they won't get anything out of it?
> > 
> > Probably a good idea to add something like that.
> 
> It will break kdump functionality, it should be fixed instead of
> just adding printk to warn user..
> 
> Thanks
> Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME

2017-03-01 Thread Dave Young
Hi Tom,

On 02/17/17 at 10:43am, Tom Lendacky wrote:
> On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote:
> > > Provide support so that kexec can be used to boot a kernel when SME is
> > > enabled.
> > 
> > Is the point of kexec and kdump to ehh, dump memory ? But if the
> > rest of the memory is encrypted you won't get much, will you?
> 
> Kexec can be used to reboot a system without going back through BIOS.
> So you can use kexec without using kdump.
> 
> For kdump, just taking a quick look, the option to enable memory
> encryption can be provided on the crash kernel command line and then

Is there a simple way to get the SME status? Probably add some sysfs
file for this purpose.

> crash kernel can would be able to copy the memory decrypted if the
> pagetable is set up properly. It looks like currently ioremap_cache()
> is used to map the old memory page.  That might be able to be changed
> to a memremap() so that the encryption bit is set in the mapping. That
> will mean that memory that is not marked encrypted (EFI tables, swiotlb
> memory, etc) would not be read correctly.

Manage to store info about those ranges which are not encrypted so that
memremap can handle them?

> 
> > 
> > Would it make sense to include some printk to the user if they
> > are setting up kdump that they won't get anything out of it?
> 
> Probably a good idea to add something like that.

It will break kdump functionality, it should be fixed instead of
just adding printk to warn user..

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-18 Thread Dave Young
Hi,

On 05/18/15 at 06:05pm, Li, ZhenHua wrote:
 Hi Joerg,
 
 Testing from HP: passed.
 Testing from He Baoquan(Redhat): passed
 
 The problem that dmar fault came again when running v10 with latest
 kernel is fixed.
 And I think there is no need to update the code to a new version now.
 
 So please tell me if there are still any code need to be changed.

Have you reviewed all my comments? Although it is only iommu driver fixes,
The patches are sent to lkml, no? If you do not want other people's comments
why will you send to us? I would say do not assume it is a closed circle for
only iommu driver.

It is not easy to review patches actually, frankly I feel kernel code review
is not so strict like the early days, maybe one reason is people are getting
tired to argue.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 05/10] iommu/vt-d: Add functions to load and save old re

2015-05-13 Thread Dave Young
On 05/13/15 at 09:47am, Li, ZhenHua wrote:
 On 05/12/2015 04:37 PM, Dave Young wrote:
 Seems the subject was truncated? Maybe re means root entry? Then please 
 fix it
 
 On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Add functions to load root entry table from old kernel, and to save updated
 root entry table.
 Add two member in struct intel_iommu, to store the RTA in old kernel, and
 the mapped virt address of it.
 
 Please explain a bit what is RTA in patch log, it is unclear to most of 
 people
 who do not know iommu details.
 
 
 
 If people want to understand this patchset, I assume they have read the vt-d
 specs. RTA is defined clearly in this spec.
 

I think explain a bit is better, it will be easier for review, RTA is defined 
in spec,
right, but if you refer to it in kernel source code, describe the meaning will 
help
people to understand your code.

I would not stick on this 'RTA', but a descriptive subject and patch log is 
still
important, please check the logs, not only for this patch.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 02/10] iommu/vt-d: Items required for kdump

2015-05-13 Thread Dave Young
On 05/13/15 at 09:45am, Li, ZhenHua wrote:
 On 05/12/2015 04:17 PM, Dave Young wrote:
 On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Add context entry functions needed for kdump.
 +/*
 + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
 + *
 + * Fixes the crashdump kernel to deal with an active iommu and legacy
 + * DMA from the (old) panicked kernel in a manner similar to how legacy
 + * DMA is handled when no hardware iommu was in use by the old kernel --
 + * allow the legacy DMA to continue into its current buffers.
 + *
 + * In the crashdump kernel, this code:
 + * 1. skips disabling the IOMMU's translating.
 + * 2. Do not re-enable IOMMU's translating.
 + * 3. In kdump kernel, use the old root entry table.
 + * 4. Allocate pages for new context entry, copy data from old context 
 entries
 + *in the old kernel to the new ones.
 + *
 + * In other kinds of kernel, for example, a kernel started by kexec,
 + * do the same thing as crashdump kernel.
 + */
 +
 +
 
 Above comments should come along with the code changes instead of putting it
 in this patch.
 
 BTW, there's one more blank line at the end..
 Code change is in 00/10, the cover letter.

I means the real code, not the changelog.

 And the blank does not matter, I checked the patch with
 scripts/checkpatch.pl,  no warnings, no errors.

Why add two line breaks if one is enough? Adding such check in checkpatch.pl
make sense to me actually.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 05/10] iommu/vt-d: Add functions to load and save old re

2015-05-12 Thread Dave Young
Seems the subject was truncated? Maybe re means root entry? Then please fix it

On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Add functions to load root entry table from old kernel, and to save updated
 root entry table.
 Add two member in struct intel_iommu, to store the RTA in old kernel, and
 the mapped virt address of it.

Please explain a bit what is RTA in patch log, it is unclear to most of people
who do not know iommu details.

 
 We use the old RTA in dump kernel, and when the iommu-root_entry is used as
 a cache in kdump kernel, its phys address will not be save to RTA register,
 but when its data is changed, we will save the new data to old root entry 
 table.
 
 Li, Zhen-hua:
 The functions and logics.
 
 Takao Indoh:
 Add __iommu_flush_cache.
 
 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 ---
  drivers/iommu/intel-iommu.c | 54 
 -
  include/linux/intel-iommu.h |  3 +++
  2 files changed, 56 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
 index 0b97c15..3a5d446 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -371,6 +371,10 @@ static struct context_entry 
 *device_to_existing_context_entry(
   struct intel_iommu *iommu,
   u8 bus, u8 devfn);
  
 +static void __iommu_load_old_root_entry(struct intel_iommu *iommu);
 +
 +static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int 
 index);
 +
  /*
   * A structure used to store the address allocated by ioremap();
   * The we need to call iounmap() to free them out of 
 spin_lock_irqsave/unlock;
 @@ -382,7 +386,6 @@ struct iommu_remapped_entry {
  static LIST_HEAD(__iommu_remapped_mem);
  static DEFINE_MUTEX(__iommu_mem_list_lock);
  
 -
  /*
   * This domain is a statically identity mapping domain.
   *   1. This domain creats a static 1:1 mapping to all usable memory.
 @@ -4935,3 +4938,52 @@ int __iommu_free_mapped_mem(void)
   return 0;
  }
  
 +/*
 + * Load the old root entry table to new root entry table.
 + */
 +static void __iommu_load_old_root_entry(struct intel_iommu *iommu)
 +{
 + if ((!iommu)
 + || (!iommu-root_entry)
 + || (!iommu-root_entry_old_virt)
 + || (!iommu-root_entry_old_phys))
 + return;
 + memcpy(iommu-root_entry, iommu-root_entry_old_virt, PAGE_SIZE);
 +
 + __iommu_flush_cache(iommu, iommu-root_entry, PAGE_SIZE);
 +}
 +
 +/*
 + * When the data in new root entry table is changed, this function
 + * must be called to save the updated data to old root entry table.
 + */
 +static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int 
 index)
 +{
 + u8 start;
 + unsigned long size;
 + void __iomem *to;
 + void *from;
 +
 + if ((!iommu)
 + || (!iommu-root_entry)
 + || (!iommu-root_entry_old_virt)
 + || (!iommu-root_entry_old_phys))
 + return;
 +
 + if (index  -1 || index = ROOT_ENTRY_NR)
 + return;
 +
 + if (index == -1) {
 + start = 0;
 + size = ROOT_ENTRY_NR * sizeof(struct root_entry);
 + } else {
 + start = index * sizeof(struct root_entry);
 + size = sizeof(struct root_entry);
 + }
 + to = iommu-root_entry_old_virt;
 + from = iommu-root_entry;
 + memcpy(to + start, from + start, size);
 +
 + __iommu_flush_cache(iommu, to + start, size);
 +}
 +
 diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
 index ced1fac..e7cac12 100644
 --- a/include/linux/intel-iommu.h
 +++ b/include/linux/intel-iommu.h
 @@ -340,6 +340,9 @@ struct intel_iommu {
   spinlock_t  lock; /* protect context, domain ids */
   struct root_entry *root_entry; /* virtual address */
  
 + void __iomem*root_entry_old_virt; /* mapped from old root entry */
 + unsigned long   root_entry_old_phys; /* root entry in old kernel */
 +
   struct iommu_flush flush;
  #endif
   struct q_inval  *qi;/* Queued invalidation info */
 -- 
 2.0.0-rc0
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 02/10] iommu/vt-d: Items required for kdump

2015-05-12 Thread Dave Young
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Add context entry functions needed for kdump.
 
 Bill Sumner:
 Original version;
 
 Li, Zhenhua:
 Changed the name of new functions, make them consistent with current
 context get/set functions.
 Remove the structure dve which is not used in new version.
 
 Signed-off-by: Bill Sumner billsumnerli...@gmail.com
 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 ---
  drivers/iommu/intel-iommu.c | 72 
 +
  1 file changed, 72 insertions(+)
 
 diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
 index cb9d6cc..1e7ceb5 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -190,6 +190,31 @@ struct root_entry {
  };
  #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
  
 +static inline bool root_present(struct root_entry *root)
 +{
 + return (root-lo  1);
 +}
 +
 +static inline void set_root_value(struct root_entry *root, unsigned long 
 value)
 +{
 + root-lo = ~VTD_PAGE_MASK;
 + root-lo |= value  VTD_PAGE_MASK;
 +}
 +
 +static inline struct context_entry *
 +get_context_addr_from_root(struct root_entry *root)
 +{
 + return (struct context_entry *)
 + (root_present(root)?phys_to_virt(
 + root-lo  VTD_PAGE_MASK) :
 + NULL);
 +}
 +
 +static inline unsigned long
 +get_context_phys_from_root(struct root_entry *root)
 +{
 + return  root_present(root) ? (root-lo  VTD_PAGE_MASK) : 0;
 +}
  
  /*
   * low 64 bits:
 @@ -211,6 +236,32 @@ static inline bool context_present(struct context_entry 
 *context)
  {
   return (context-lo  1);
  }
 +
 +static inline int context_fault_enable(struct context_entry *c)
 +{
 + return((c-lo  1)  0x1);
 +}
 +
 +static inline int context_translation_type(struct context_entry *c)
 +{
 + return((c-lo  2)  0x3);
 +}
 +
 +static inline u64 context_address_root(struct context_entry *c)
 +{
 + return((c-lo  VTD_PAGE_SHIFT));
 +}
 +
 +static inline int context_address_width(struct context_entry *c)
 +{
 + return((c-hi  0)  0x7);
 +}
 +
 +static inline int context_domain_id(struct context_entry *c)
 +{
 + return((c-hi  8)  0x);
 +}
 +
  static inline void context_set_present(struct context_entry *context)
  {
   context-lo |= 1;
 @@ -296,6 +347,27 @@ static inline int first_pte_in_page(struct dma_pte *pte)
   return !((unsigned long)pte  ~VTD_PAGE_MASK);
  }
  
 +
 +/*
 + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
 + *
 + * Fixes the crashdump kernel to deal with an active iommu and legacy
 + * DMA from the (old) panicked kernel in a manner similar to how legacy
 + * DMA is handled when no hardware iommu was in use by the old kernel --
 + * allow the legacy DMA to continue into its current buffers.
 + *
 + * In the crashdump kernel, this code:
 + * 1. skips disabling the IOMMU's translating.
 + * 2. Do not re-enable IOMMU's translating.
 + * 3. In kdump kernel, use the old root entry table.
 + * 4. Allocate pages for new context entry, copy data from old context 
 entries
 + *in the old kernel to the new ones.
 + *
 + * In other kinds of kernel, for example, a kernel started by kexec,
 + * do the same thing as crashdump kernel.
 + */
 +
 +

Above comments should come along with the code changes instead of putting it
in this patch.

BTW, there's one more blank line at the end..

  /*
   * This domain is a statically identity mapping domain.
   *   1. This domain creats a static 1:1 mapping to all usable memory.
 -- 
 2.0.0-rc0
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 04/10] iommu/vt-d: functions to copy data from old mem

2015-05-12 Thread Dave Young
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Add some functions to copy the data from old kernel.
 These functions are used to copy context tables and page tables.
 
 To avoid calling iounmap between spin_lock_irqsave and spin_unlock_irqrestore,
 use a link here, store the pointers , and then use iounmap to free them in
 another place.
 
 Li, Zhen-hua:
 The functions and logics.
 
 Takao Indoh:
 Check if pfn is ram:
 if (page_is_ram(pfn))
 
 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com
 ---
  drivers/iommu/intel-iommu.c | 102 
 
  include/linux/intel-iommu.h |   6 +++
  2 files changed, 108 insertions(+)
 
 diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
 index 07e6118..0b97c15 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -371,6 +371,17 @@ static struct context_entry 
 *device_to_existing_context_entry(
   struct intel_iommu *iommu,
   u8 bus, u8 devfn);
  
 +/*
 + * A structure used to store the address allocated by ioremap();
 + * The we need to call iounmap() to free them out of 
 spin_lock_irqsave/unlock;
 + */
 +struct iommu_remapped_entry {
 + struct list_head list;
 + void __iomem *mem;
 +};
 +static LIST_HEAD(__iommu_remapped_mem);
 +static DEFINE_MUTEX(__iommu_mem_list_lock);
 +
  
  /*
   * This domain is a statically identity mapping domain.
 @@ -4833,3 +4844,94 @@ static struct context_entry 
 *device_to_existing_context_entry(
   return ret;
  }
  
 +/*
 + * Copy memory from a physically-addressed area into a virtually-addressed 
 area
 + */
 +int __iommu_load_from_oldmem(void *to, unsigned long from, unsigned long 
 size)
 +{
 + unsigned long pfn;  /* Page Frame Number */
 + size_t csize = (size_t)size;/* Num(bytes to copy) */
 + unsigned long offset;   /* Lower 12 bits of to */

comment for above variable are unnecessary, they are clear enough.
BTW, use size_t size in function argument is better.

 + void __iomem *virt_mem;
 + struct iommu_remapped_entry *mapped;
 +
 + pfn = from  VTD_PAGE_SHIFT;
 + offset = from  (~VTD_PAGE_MASK);
 +
 + if (page_is_ram(pfn)) {
 + memcpy(to, pfn_to_kaddr(pfn) + offset, csize);
 + } else{
 +
 + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
 + GFP_KERNEL);
 + if (!mapped)
 + return -ENOMEM;
 +
 + virt_mem = ioremap_cache((unsigned long)from, size);
 + if (!virt_mem) {
 + kfree(mapped);
 + return -ENOMEM;
 + }
 + memcpy(to, virt_mem, size);

csize or size? as previous comment use size_t in argument it will be ok here.

 +
 + mutex_lock(__iommu_mem_list_lock);
 + mapped-mem = virt_mem;
 + list_add_tail(mapped-list, __iommu_remapped_mem);
 + mutex_unlock(__iommu_mem_list_lock);
 + }
 + return size;

type mismatch?

 +}
 +
 +/*
 + * Copy memory from a virtually-addressed area into a physically-addressed 
 area
 + */
 +int __iommu_save_to_oldmem(unsigned long to, void *from, unsigned long size)
 +{
 + unsigned long pfn;  /* Page Frame Number */
 + size_t csize = (size_t)size;/* Num(bytes to copy) */
 + unsigned long offset;   /* Lower 12 bits of to */

Ditto about data type and comments..

 + void __iomem *virt_mem;
 + struct iommu_remapped_entry *mapped;
 +
 + pfn = to  VTD_PAGE_SHIFT;
 + offset = to  (~VTD_PAGE_MASK);
 +
 + if (page_is_ram(pfn)) {
 + memcpy(pfn_to_kaddr(pfn) + offset, from, csize);
 + } else{
 + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
 + GFP_KERNEL);
 + if (!mapped)
 + return -ENOMEM;
 +
 + virt_mem = ioremap_cache((unsigned long)to, size);
 + if (!virt_mem) {
 + kfree(mapped);
 + return -ENOMEM;
 + }
 + memcpy(virt_mem, from, size);
 + mutex_lock(__iommu_mem_list_lock);
 + mapped-mem = virt_mem;
 + list_add_tail(mapped-list, __iommu_remapped_mem);
 + mutex_unlock(__iommu_mem_list_lock);
 + }
 + return size;
 +}

Above two functions looks duplicate, can they be merged as one function and
add a argument about direction?

 +
 +/*
 + * Free the mapped memory for ioremap;
 + */
 +int __iommu_free_mapped_mem(void)
 +{
 + struct iommu_remapped_entry *mem_entry, *tmp;
 +
 + mutex_lock(__iommu_mem_list_lock);
 + list_for_each_entry_safe(mem_entry, tmp, __iommu_remapped_mem, list) {
 + iounmap(mem_entry-mem);
 + list_del(mem_entry-list);
 + kfree(mem_entry);
 + }
 + 

Re: [PATCH v11 06/10] iommu/vt-d: datatypes and functions used for kdump

2015-05-12 Thread Dave Young
The patch subject is bad, previous patch you use Items for kdump, this
patch you use datatypes and functions used for kdump then what's the
difference between these two patches?

Please think about a better one for these patches..

On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Populate it with support functions to copy iommu translation tables from
 from the panicked kernel into the kdump kernel in the event of a crash.
 
 Functions:
 Use old root entry table, and load the old data to root_entry as cache.
 Malloc new context table and copy old context table to the new one.
 
 Bill Sumner:
 Original version, the creation of the data types and functions.
 
 Li, Zhenhua:
 Create new function iommu_check_pre_te_status() to check status.
 Update the caller of context_get_* and context_put*, use context_*
 and context_set_* for replacement.
 Update the name of the function that loads root entry table.
 Use new function to copy old context entry tables and page tables.
 Use unsigned long for physical address.
 Remove the functions to copy page table in Bill's version.
 Remove usage of dve and ppap in Bill's version.
 
 Signed-off-by: Bill Sumner billsumnerli...@gmail.com
 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 ---
  drivers/iommu/intel-iommu.c | 121 
 
  include/linux/intel-iommu.h |   3 ++
  2 files changed, 124 insertions(+)
 
 diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
 index 3a5d446..28c3c64 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -386,6 +386,18 @@ struct iommu_remapped_entry {
  static LIST_HEAD(__iommu_remapped_mem);
  static DEFINE_MUTEX(__iommu_mem_list_lock);
  
 +/* 

Please remove the ..

 + * Copy iommu translation tables from old kernel into new  kernel.

One more space between new and kernel

 + * Entry to this set of functions is: intel_iommu_load_translation_tables()
 + * 

Drop above --- line is better

 + */
 +
 +static int copy_root_entry_table(struct intel_iommu *iommu);
 +
 +static int intel_iommu_load_translation_tables(struct intel_iommu *iommu);
 +
 +static void iommu_check_pre_te_status(struct intel_iommu *iommu);
 +
  /*
   * This domain is a statically identity mapping domain.
   *   1. This domain creats a static 1:1 mapping to all usable memory.
 @@ -4987,3 +4999,112 @@ static void __iommu_update_old_root_entry(struct 
 intel_iommu *iommu, int index)
   __iommu_flush_cache(iommu, to + start, size);
  }
  
 +/*
 + * Load root entry tables from old kernel.
 + */
 +static int copy_root_entry_table(struct intel_iommu *iommu)
 +{
 + u32 bus;/* Index: root-entry-table */
 + struct root_entry  *re; /* Virt(iterator: new table) */
 + unsigned long context_old_phys; /* Phys(context table entry) */
 + struct context_entry *context_new_virt; /* Virt(new context_entry) */
 +
 + /*
 +  * A new root entry table has been allocated ,

One more space before ,'

 +  * we need copy re from old kernel to the new allocated one.
 +  */
 +
 + if (!iommu-root_entry_old_phys)
 + return -ENOMEM;
 +
 + for (bus = 0, re = iommu-root_entry; bus  256; bus += 1, re += 1) {
 + if (!root_present(re))
 + continue;
 +
 + context_old_phys = get_context_phys_from_root(re);
 +
 + if (!context_old_phys)
 + continue;
 +
 + context_new_virt =
 + (struct context_entry *)alloc_pgtable_page(iommu-node);
 +
 + if (!context_new_virt)
 + return -ENOMEM;
 +
 + __iommu_load_from_oldmem(context_new_virt,
 + context_old_phys,
 + VTD_PAGE_SIZE);
 +
 + __iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE);
 +
 + set_root_value(re, virt_to_phys(context_new_virt));
 + }
 +
 + return 0;
 +}
 +
 +/*
 + * Interface to the load translation tables set of functions
 + * from mainline code.
 + */
 +static int intel_iommu_load_translation_tables(struct intel_iommu *iommu)
 +{
 + unsigned long long q;   /* quadword scratch */
 + int ret = 0;/* Integer return code */

comment for ret is not necessary, quadword is also unnecessary, just explain
the purpose of 'q' is enough.

 + unsigned long flags;
 +
 + q = dmar_readq(iommu-reg + DMAR_RTADDR_REG);
 + if (!q)
 + return -1;
 +
 + spin_lock_irqsave(iommu-lock, flags);
 +
 + /* Load the root-entry table from the old kernel
 +  * foreach context_entry_table in root_entry
 +  *   Copy each entry table from old kernel
 +  */
 + if 

Re: [PATCH v11 07/10] iommu/vt-d: enable kdump support in iommu module

2015-05-12 Thread Dave Young
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 Modify the operation of the following functions when called during crash dump:
 iommu_context_addr
 free_context_table
 get_domain_for_dev
 init_dmars
 intel_iommu_init
 
 Bill Sumner:
 Original version.
 
 Zhenhua:
 The name of new calling functions.
 Do not disable and re-enable TE in kdump kernel.
 Use the did and gaw from old context entry;
 
 Signed-off-by: Bill Sumner billsumnerli...@gmail.com
 Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com
 ---
  drivers/iommu/intel-iommu.c | 95 
 +++--
  1 file changed, 83 insertions(+), 12 deletions(-)
 
 diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
 index 28c3c64..91545bf 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -397,6 +397,7 @@ static int copy_root_entry_table(struct intel_iommu 
 *iommu);
  static int intel_iommu_load_translation_tables(struct intel_iommu *iommu);
  
  static void iommu_check_pre_te_status(struct intel_iommu *iommu);
 +static u8 g_translation_pre_enabled;
  
  /*
   * This domain is a statically identity mapping domain.
 @@ -794,6 +795,9 @@ static inline struct context_entry 
 *iommu_context_addr(struct intel_iommu *iommu
   phy_addr = virt_to_phys((void *)context);
   *entry = phy_addr | 1;
   __iommu_flush_cache(iommu, entry, sizeof(*entry));
 +
 + if (iommu-pre_enabled_trans)
 + __iommu_update_old_root_entry(iommu, bus);
   }
   return context[devfn];
  }
 @@ -887,13 +891,15 @@ static void clear_context_table(struct intel_iommu 
 *iommu, u8 bus, u8 devfn)
  
  static void free_context_table(struct intel_iommu *iommu)
  {
 + struct root_entry *root = NULL;
   int i;
   unsigned long flags;
   struct context_entry *context;
  
   spin_lock_irqsave(iommu-lock, flags);
   if (!iommu-root_entry) {
 - goto out;
 + spin_unlock_irqrestore(iommu-lock, flags);
 + return;
   }
   for (i = 0; i  ROOT_ENTRY_NR; i++) {
   context = iommu_context_addr(iommu, i, 0, 0);
 @@ -908,10 +914,23 @@ static void free_context_table(struct intel_iommu 
 *iommu)
   free_pgtable_page(context);
  
   }
 +
 + if (iommu-pre_enabled_trans) {
 + iommu-root_entry_old_phys = 0;
 + root = iommu-root_entry_old_virt;
 + iommu-root_entry_old_virt = NULL;
 + }
 +
   free_pgtable_page(iommu-root_entry);
   iommu-root_entry = NULL;
 -out:
 +
   spin_unlock_irqrestore(iommu-lock, flags);
 +
 + /* We put this out of spin_unlock is because iounmap() may
 +  * cause error if surrounded by spin_lock and unlock;
 +  */
 + if (iommu-pre_enabled_trans)
 + iounmap(root);
  }
  
  static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
 @@ -2333,6 +2352,7 @@ static struct dmar_domain *get_domain_for_dev(struct 
 device *dev, int gaw)
   unsigned long flags;
   u8 bus, devfn;
   int did = -1;   /* Default to no domain_id supplied */
 + struct context_entry *ce = NULL;
  
   domain = find_domain(dev);
   if (domain)
 @@ -2366,6 +2386,20 @@ static struct dmar_domain *get_domain_for_dev(struct 
 device *dev, int gaw)
   domain = alloc_domain(0);
   if (!domain)
   return NULL;
 +
 + if (iommu-pre_enabled_trans) {
 + /*
 +  * if this device had a did in the old kernel
 +  * use its values instead of generating new ones
 +  */
 + ce = device_to_existing_context_entry(iommu, bus, devfn);
 +
 + if (ce) {
 + did = context_domain_id(ce);
 + gaw = agaw_to_width(context_address_width(ce));
 + }
 + }
 +
   domain-id = iommu_attach_domain_with_id(domain, iommu, did);
   if (domain-id  0) {
   free_domain_mem(domain);
 @@ -2897,6 +2931,7 @@ static int __init init_dmars(void)
   goto free_g_iommus;
   }
  
 + g_translation_pre_enabled = 0; /* To know whether to skip RMRR */
   for_each_active_iommu(iommu, drhd) {
   g_iommus[iommu-seq_id] = iommu;
  
 @@ -2904,14 +2939,30 @@ static int __init init_dmars(void)
   if (ret)
   goto free_iommu;
  
 - /*
 -  * TBD:
 -  * we could share the same root  context tables
 -  * among all IOMMU's. Need to Split it later.
 -  */
 - ret = iommu_alloc_root_entry(iommu);
 - if (ret)
 - goto free_iommu;
 + iommu_check_pre_te_status(iommu);
 + if (iommu-pre_enabled_trans) {
 + pr_info(IOMMU Copying translate tables from panicked 
 kernel\n);
 + ret = intel_iommu_load_translation_tables(iommu);
 +

Re: [PATCH v11 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-12 Thread Dave Young
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix for:
 If a kernel boots with intel_iommu=on on a system that supports intel vt-d, 
 when a panic happens, the kdump kernel will boot with these faults:
 
 dmar: DRHD: handling fault status reg 102
 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
 DMAR:[fault reason 01] Present bit in root entry is clear
 
 dmar: DRHD: handling fault status reg 2
 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
 
 On some system, the interrupt remapping fault will also happen even if the 
 intel_iommu is not set to on, because the interrupt remapping will be enabled 
 when x2apic is needed by the system.
 
 The cause of the DMA fault is described in Bill's original version, and the 
 INTR-Remap fault is caused by a similar reason. In short, the initialization 
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong 
 response.
 
 To fix this problem, we modifies the behaviors of the intel vt-d in the 
 crashdump kernel:
 
 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table, copy data from the old ones that
used by the old kernel.
 5. Keep using the old page tables before driver is loaded.
 6. After device driver is loaded, when it issues the first dma_map command, 
free the dmar_domain structure for this device, and generate a new one, so 
that the device can be assigned a new and empty page table. 
 7. When a new context entry table is generated, we also save its address to 
the old root entry table.
 
 For Interrupt Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
 4. When ioapic entry is setup, the interrupt remapping table is changed, and 
the updated data will be stored to the old interrupt remapping table.
 
 Advantages of this approach:
 1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
 2. This approach behaves in a manner very similar to operation without an
active iommu.
 3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
 4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump. 
 5. Minimal code-changes among the existing mainline intel vt-d code.
 
 Summary of changes in this patch set:
 1. Added some useful function for root entry table in code intel-iommu.c
 2. Added new members to struct root_entry and struct irte;
 3. Functions to load old root entry table to iommu-root_entry from the 
 memory 
of old kernel.
 4. Functions to malloc new context entry table and copy the data from the old
ones to the malloced new ones.
 5. Functions to enable support for DMA remapping in kdump kernel.
 6. Functions to load old irte data from the old kernel to the kdump kernel.
 7. Some code changes that support other behaviours that have been listed.
 8. In the new functions, use physical address as unsigned long type, not 
pointers.
 
 Original version by Bill Sumner:
 https://lkml.org/lkml/2014/1/10/518
 https://lkml.org/lkml/2014/4/15/716
 https://lkml.org/lkml/2014/4/24/836
 
 Zhenhua's updates:
 https://lkml.org/lkml/2014/10/21/134
 https://lkml.org/lkml/2014/12/15/121
 https://lkml.org/lkml/2014/12/22/53
 https://lkml.org/lkml/2015/1/6/1166
 https://lkml.org/lkml/2015/1/12/35
 https://lkml.org/lkml/2015/3/19/33
 https://lkml.org/lkml/2015/4/10/135
 
 Changelog[v11]:
 1. Fix some conflicts with the latest upstream kernel.
Add flush in iommu_context_addr
 
 Changelog[v10]:
 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
Use one flag which stores the te and ir status in last kernel:
iommu-pre_enabled_trans
iommu-pre_enabled_ir
 
 Changelog[v9]:
 1. Add new function iommu_attach_domain_with_id.
 2. Do not copy old page tables, keep using the old ones.
 3. Remove functions:
intel_iommu_did_to_domain_values_entry
intel_iommu_get_dids_from_old_kernel
device_to_domain_id
copy_page_addr
copy_page_table
copy_context_entry
copy_context_entry_table
 4. Add new function device_to_existing_context_entry.
 
 Changelog[v8]:
 1. Add a missing __iommu_flush_cache in function copy_page_table.
 
 

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-11 Thread Dave Young
On 05/11/15 at 12:11pm, Joerg Roedel wrote:
 On Thu, May 07, 2015 at 09:56:00PM +0800, Dave Young wrote:
  Joreg, I can not find the last reply from you, so just reply here about
  my worries here.
  
  I said that the patchset will cause more problems, let me explain about
  it more here:
  
  Suppose page table was corrupted, ie. original mapping iova1 - page 1
  it was changed to iova1 - page 2 accidently while crash happening,
  thus future dma will read/write page 2 instead page 1, right?
 
 When the page-table is corrupted then it is a left-over from the old
 kernel. When the kdump kernel boots the situation can at least not get
 worse. For the page tables it is also hard to detect wrong mappings (if
 this would be possible the hardware could already do it), so any checks
 we could do there are of limited use anyway.

Joerg, since both of you guys do not think it is a problem I will object it 
any more though I still do not like reusing the old page tables. So let's
leave it as a future issue.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-07 Thread Dave Young
On 04/07/15 at 10:12am, Don Dutile wrote:
 On 04/06/2015 11:46 PM, Dave Young wrote:
 On 04/05/15 at 09:54am, Baoquan He wrote:
 On 04/03/15 at 05:21pm, Dave Young wrote:
 On 04/03/15 at 05:01pm, Li, ZhenHua wrote:
 Hi Dave,
 
 There may be some possibilities that the old iommu data is corrupted by
 some other modules. Currently we do not have a better solution for the
 dmar faults.
 
 But I think when this happens, we need to fix the module that corrupted
 the old iommu data. I once met a similar problem in normal kernel, the
 queue used by the qi_* functions was written again by another module.
 The fix was in that module, not in iommu module.
 
 It is too late, there will be no chance to save vmcore then.
 
 Also if it is possible to continue corrupt other area of oldmem because
 of using old iommu tables then it will cause more problems.
 
 So I think the tables at least need some verifycation before being used.
 
 
 Yes, it's a good thinking anout this and verification is also an
 interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
 and then verify this again when panic happens in purgatory. This checks
 whether any code stomps into region reserved for kexec/kernel and corrupt
 the loaded kernel.
 
 If this is decided to do it should be an enhancement to current
 patchset but not a approach change. Since this patchset is going very
 close to point as maintainers expected maybe this can be merged firstly,
 then think about enhancement. After all without this patchset vt-d often
 raised error message, hung.
 
 It does not convince me, we should do it right at the beginning instead of
 introduce something wrong.
 
 I wonder why the old dma can not be remap to a specific page in kdump kernel
 so that it will not corrupt more memory. But I may missed something, I will
 looking for old threads and catch up.
 
 Thanks
 Dave
 
 The (only) issue is not corruption, but once the iommu is re-configured, the 
 old,
 not-stopped-yet, dma engines will use iova's that will generate dmar faults, 
 which
 will be enabled when the iommu is re-configured (even to a single/simple 
 paging scheme)
 in the kexec kernel.
 

Don, so if iommu is not reconfigured then these faults will not happen?

Baoquan and me has a confusion below today about iommu=off/intel_iommu=off:

intel_iommu_init()
{
...

   dmar_table_init();

   disable active iommu translations;

   if (no_iommu || dmar_disabled)
goto out_free_dmar;

...
}

Any reason not move no_iommu check to the begining of intel_iommu_init function?

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-07 Thread Dave Young
On 05/06/15 at 10:16am, Joerg Roedel wrote:
 On Wed, May 06, 2015 at 09:46:49AM +0800, Dave Young wrote:
  For the original problem, the key issue is dmar faults cause kdump kernel
  hang so that vmcore can not be saved. I do not know the reason why it hangs
  I think it is acceptable if kdump kernel boot ok with some dma errors..
 
 It hangs because some devices can't handle the DMAR faults and the kdump
 kernel can't initialize them and will hang itself. For that it doesn't
 matter whether the fault was caused by a read or write request.

Ok, thanks for explanation. so it explains sometimes kdump kernel boot ok
with faults, sometimes it hangs instead.

Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-07 Thread Dave Young
On 05/04/15 at 01:05pm, Joerg Roedel wrote:
 On Fri, Apr 03, 2015 at 04:40:31PM +0800, Dave Young wrote:
  Have not read all the patches, but I have a question, not sure this
  has been answered before. Old memory is not reliable, what if the old
  memory get corrupted before panic? Is it safe to continue using it in
  2nd kernel, I worry that it will cause problems.
 
 Yes, the old memory could be corrupted, and there are more failure cases
 left which we have no way of handling yet (if iommu data structures are
 in kdump backup areas).
 
 The question is what to do if we find some of the old data structures
 corrupted, hand how far should the tests go. Should we also check the
 page-tables, for example? I think if some of the data structures for a
 device are corrupted it probably already failed in the old kernel and
 things won't get worse in the new one.

Joreg, I can not find the last reply from you, so just reply here about
my worries here.

I said that the patchset will cause more problems, let me explain about
it more here:

Suppose page table was corrupted, ie. original mapping iova1 - page 1
it was changed to iova1 - page 2 accidently while crash happening,
thus future dma will read/write page 2 instead page 1, right?

so the behavior changes like:
originally, dmar faults happen, but kdump kernel may boot ok with these
faults, and vmcore can be saved.
with the patchset, dmar faults does not happen, dma translation will be
handled, but dma write could corrupt unrelevant memory.

This might be corner case, but who knows because kernel paniced we can
not assume old page table is right.

But seems you all think it is safe, but let us understand each other
first then go to a solution.

Today we talked with Zhenhua about the problem I think both of us are clear
about the problems. Just he think it can be left as future work.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-05 Thread Dave Young
On 05/05/15 at 05:31pm, Joerg Roedel wrote:
 On Tue, May 05, 2015 at 02:14:23PM +0800, Dave Young wrote:
  The failure is nothing different, but as I said in another reply the
  difference is we could use corrupted data to possiblly cause more failure.
 
 I still fail to see how things can get more worse than they already are
 by reusing the old data (we just reuse it, we do not modify anything

DMA write will modify system ram, if the old data is corrupted  it is possible
that DMA operation modify wrong ram regions because of wrong mapping.
Am I missing something and is it not possible?

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-05 Thread Dave Young
On 05/05/15 at 05:23pm, Joerg Roedel wrote:
 On Tue, May 05, 2015 at 02:09:31PM +0800, Dave Young wrote:
  I agree that we can do nothing with the old corrupted data, but I worry
  about the future corruption after using the old corrupted data. I wonder
  if we can mark all the oldmem as readonly so that we can lower the risk.
  Is it resonable?
 
 Do you mean marking it read-only for the devices? That will very likely
 cause DMAR faults, re-introducing the problem this patch-set tries to
 fix.

I means to block all dma write to oldmem, I believe it will cause DMA error.
But all other DMA reading requests will continue and work. This will avoid
future possible corruption. It will solve half of the problem at least?

For the original problem, the key issue is dmar faults cause kdump kernel
hang so that vmcore can not be saved. I do not know the reason why it hangs
I think it is acceptable if kdump kernel boot ok with some dma errors..

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-05 Thread Dave Young
On 05/04/15 at 06:23pm, Joerg Roedel wrote:
 On Fri, Apr 24, 2015 at 04:49:57PM +0800, Dave Young wrote:
  I'm more than happy to see this issue can be fixed in the patchset, I
  do not agree to add the code there with such problems. OTOH, for now
  seems there's no way to fix it.
 
 And that's the point. We discuss this issue and possible solutions for
 years by now, and what ZhenHua implemented is what we agreed to be the
 best-effort on what we can do in the kdump case with IOMMU enabled.
 
 Of course there are still failure scenarios left, but that is not
 different from systems without any IOMMU.

The failure is nothing different, but as I said in another reply the
difference is we could use corrupted data to possiblly cause more failure. 

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-05 Thread Dave Young
On 05/04/15 at 01:05pm, Joerg Roedel wrote:
 On Fri, Apr 03, 2015 at 04:40:31PM +0800, Dave Young wrote:
  Have not read all the patches, but I have a question, not sure this
  has been answered before. Old memory is not reliable, what if the old
  memory get corrupted before panic? Is it safe to continue using it in
  2nd kernel, I worry that it will cause problems.
 
 Yes, the old memory could be corrupted, and there are more failure cases
 left which we have no way of handling yet (if iommu data structures are
 in kdump backup areas).
 
 The question is what to do if we find some of the old data structures
 corrupted, hand how far should the tests go. Should we also check the
 page-tables, for example? I think if some of the data structures for a
 device are corrupted it probably already failed in the old kernel and
 things won't get worse in the new one.

I agree that we can do nothing with the old corrupted data, but I worry
about the future corruption after using the old corrupted data. I wonder
if we can mark all the oldmem as readonly so that we can lower the risk.
Is it resonable?

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-24 Thread Dave Young
On 04/24/15 at 04:35pm, Baoquan He wrote:
 On 04/24/15 at 04:25pm, Dave Young wrote:
  Hi, Baoquan
  
   I support this patchset.
   
   We should not fear oldmem since reserved crashkernel region is similar.
   No one can guarantee that any crazy code won't step into crashkernel
   region just because 1st kernel says it's reversed for kdump kernel. Here
   the root table and context tables are also not built to allow legal code
   to danamge. Both of them has the risk to be corrupted, for trying our
   best to get a dumped vmcore the risk is worth being taken.
  
  old mem is mapped in 1st kernel so compare with the reserved crashkernel
  they are more likely to be corrupted. they are totally different. 
 
 Could you tell how and why they are different? Wrong code will choose
 root tables and context tables to danamge when they totally lose
 control?

iommu will map io address to system ram, right? not to reserved ram, but
yes I'm assuming the page table is right, but I was worrying they are corrupted
while kernel panic is happening.

 
  
   
   And the resetting pci way has been NACKed by David Woodhouse, the
   maintainer of intel iommu. Because the place calling the resetting pci
   code is ugly before kdump kernel or in kdump kernel. And as he said a
   certain device made mistakes why we blame on all devices. We should fix
   that device who made mistakes. 
  
  Resetting pci bus is not ugly than fixing a problem with risk and to fix
  the problem it introduced in the future.
 
 There's a problem, we fix the problem. If that's uglier, I need redefine
 the 'ugly' in my personal dict. You mean the problem it could introduce
 is wrong code will damage root table and context tables, why don't we
 fix that wrong code, but blame innocent context tables? So you mean
 these tables should deserve being damaged by wrong code?

I'm more than happy to see this issue can be fixed in the patchset, I do not
agree to add the code there with such problems. OTOH, for now seems there's
no way to fix it.

 
  
  I know it is late to speak out, but sorry I still object and have to NACK 
  this
  oldmem approach from my point.
  
  Thanks
  Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-20 Thread Dave Young
Hi,

On 04/21/15 at 09:39am, Li, ZhenHua wrote:
 Hi Dave,
 I found the old mail:
 http://lkml.iu.edu/hypermail/linux/kernel/1410.2/03584.html

I know and I have read it before.

==  quote  ===
   So with this in mind I would prefer initially taking over the
   page-tables from the old kernel before the device drivers re-initialize
   the devices.
 
  This makes the dump kernel more dependent on data from the old kernel,
  which we obviously want to avoid when possible.

 Sure, but this is not really possible here (unless we have a generic and
 reliable way to reset all PCI endpoint devices and cancel all in-flight
 DMA before we disable the IOMMU in the kdump kernel).
 Otherwise we always risk data corruption somewhere, in system memory or
 on disk.
=  quote  

What I understand above is it is not really possible to avoid the problem.

But IMHO we should avoid it or we will have problems in the future, if we
really cannot avoid it I would say switching to pci reset way is better.

 
 Please check this and you will find the discussion.
 
 Regards
 Zhenhua
 
 On 04/15/2015 02:48 PM, Dave Young wrote:
 On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
 On 04/15/2015 08:57 AM, Dave Young wrote:
 Again, I think it is bad to use old page table, below issues need consider:
 1) make sure old page table are reliable across crash
 2) do not allow writing oldmem after crash
 
 Please correct me if I'm wrong, or if above is not doable I think I will 
 vote for
 resetting pci bus.
 
 Thanks
 Dave
 
 Hi Dave,
 
 When updating the context tables, we have to write their address to root
 tables, this will cause writing to old mem.
 
 Resetting the pci bus has been discussed, please check this:
 http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
 https://lkml.org/lkml/2014/10/21/890
 
 I know one reason to use old pgtable is this looks better because it fixes 
 the
 real problem, but it is not a good way if it introduce more problems because 
 of
 it have to use oldmem. I will be glad if this is not a problem but I have not
 been convinced.
 
 OTOH, there's many types of iommu, intel, amd, a lot of other types. They 
 need
 their own fixes, so it looks not that elegant.
 
 For pci reset, it is not perfect, but it has another advantage, the patch is
 simpler. The problem I see from the old discusssion is, reset bus in 2nd 
 kernel
 is acceptable but it does not fix things on sparc platform. AFAIK current 
 reported
 problems are intel and amd iommu, at least pci reset stuff does not make it 
 worse.
 
 Thanks
 Dave
 
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-15 Thread Dave Young
On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
 On 04/15/2015 08:57 AM, Dave Young wrote:
 Again, I think it is bad to use old page table, below issues need consider:
 1) make sure old page table are reliable across crash
 2) do not allow writing oldmem after crash
 
 Please correct me if I'm wrong, or if above is not doable I think I will 
 vote for
 resetting pci bus.
 
 Thanks
 Dave
 
 Hi Dave,
 
 When updating the context tables, we have to write their address to root
 tables, this will cause writing to old mem.
 
 Resetting the pci bus has been discussed, please check this:
 http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
 https://lkml.org/lkml/2014/10/21/890

I know one reason to use old pgtable is this looks better because it fixes the
real problem, but it is not a good way if it introduce more problems because of
it have to use oldmem. I will be glad if this is not a problem but I have not
been convinced.

OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
their own fixes, so it looks not that elegant.

For pci reset, it is not perfect, but it has another advantage, the patch is
simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel
is acceptable but it does not fix things on sparc platform. AFAIK current 
reported
problems are intel and amd iommu, at least pci reset stuff does not make it 
worse.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-14 Thread Dave Young
On 04/10/15 at 04:42pm, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix for:
 If a kernel boots with intel_iommu=on on a system that supports intel vt-d, 
 when a panic happens, the kdump kernel will boot with these faults:
 
 dmar: DRHD: handling fault status reg 102
 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
 DMAR:[fault reason 01] Present bit in root entry is clear
 
 dmar: DRHD: handling fault status reg 2
 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
 
 On some system, the interrupt remapping fault will also happen even if the 
 intel_iommu is not set to on, because the interrupt remapping will be enabled 
 when x2apic is needed by the system.
 
 The cause of the DMA fault is described in Bill's original version, and the 
 INTR-Remap fault is caused by a similar reason. In short, the initialization 
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong 
 response.
 
 To fix this problem, we modifies the behaviors of the intel vt-d in the 
 crashdump kernel:
 
 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table, copy data from the old ones that
used by the old kernel.
 5. Keep using the old page tables before driver is loaded.
 6. After device driver is loaded, when it issues the first dma_map command, 
free the dmar_domain structure for this device, and generate a new one, so 
that the device can be assigned a new and empty page table. 
 7. When a new context entry table is generated, we also save its address to 
the old root entry table.
 
 For Interrupt Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
 4. When ioapic entry is setup, the interrupt remapping table is changed, and 
the updated data will be stored to the old interrupt remapping table.
 
 Advantages of this approach:
 1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
 2. This approach behaves in a manner very similar to operation without an
active iommu.
 3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
 4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump. 
 5. Minimal code-changes among the existing mainline intel vt-d code.
 
 Summary of changes in this patch set:
 1. Added some useful function for root entry table in code intel-iommu.c
 2. Added new members to struct root_entry and struct irte;
 3. Functions to load old root entry table to iommu-root_entry from the 
 memory 
of old kernel.
 4. Functions to malloc new context entry table and copy the data from the old
ones to the malloced new ones.
 5. Functions to enable support for DMA remapping in kdump kernel.
 6. Functions to load old irte data from the old kernel to the kdump kernel.
 7. Some code changes that support other behaviours that have been listed.
 8. In the new functions, use physical address as unsigned long type, not 
pointers.
 
 Original version by Bill Sumner:
 https://lkml.org/lkml/2014/1/10/518
 https://lkml.org/lkml/2014/4/15/716
 https://lkml.org/lkml/2014/4/24/836
 
 Zhenhua's updates:
 https://lkml.org/lkml/2014/10/21/134
 https://lkml.org/lkml/2014/12/15/121
 https://lkml.org/lkml/2014/12/22/53
 https://lkml.org/lkml/2015/1/6/1166
 https://lkml.org/lkml/2015/1/12/35
 https://lkml.org/lkml/2015/3/19/33
 
 Changelog[v10]:
 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
Use one flag which stores the te and ir status in last kernel:
iommu-pre_enabled_trans
iommu-pre_enabled_ir
 
 Changelog[v9]:
 1. Add new function iommu_attach_domain_with_id.
 2. Do not copy old page tables, keep using the old ones.
 3. Remove functions:
intel_iommu_did_to_domain_values_entry
intel_iommu_get_dids_from_old_kernel
device_to_domain_id
copy_page_addr
copy_page_table
copy_context_entry
copy_context_entry_table
 4. Add new function device_to_existing_context_entry.
 
 Changelog[v8]:
 1. Add a missing __iommu_flush_cache in function copy_page_table.
 
 Changelog[v7]:
 1. Use __iommu_flush_cache to flush the data to hardware.
 
 Changelog[v6]:
 1. Use unsigned long as type of physical address.
 2. Use 

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-07 Thread Dave Young
On 04/07/15 at 11:46am, Dave Young wrote:
 On 04/05/15 at 09:54am, Baoquan He wrote:
  On 04/03/15 at 05:21pm, Dave Young wrote:
   On 04/03/15 at 05:01pm, Li, ZhenHua wrote:
Hi Dave,

There may be some possibilities that the old iommu data is corrupted by
some other modules. Currently we do not have a better solution for the
dmar faults.

But I think when this happens, we need to fix the module that corrupted
the old iommu data. I once met a similar problem in normal kernel, the
queue used by the qi_* functions was written again by another module.
The fix was in that module, not in iommu module.
   
   It is too late, there will be no chance to save vmcore then.
   
   Also if it is possible to continue corrupt other area of oldmem because
   of using old iommu tables then it will cause more problems.
   
   So I think the tables at least need some verifycation before being used.
   
  
  Yes, it's a good thinking anout this and verification is also an
  interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
  and then verify this again when panic happens in purgatory. This checks
  whether any code stomps into region reserved for kexec/kernel and corrupt
  the loaded kernel.
  
  If this is decided to do it should be an enhancement to current
  patchset but not a approach change. Since this patchset is going very
  close to point as maintainers expected maybe this can be merged firstly,
  then think about enhancement. After all without this patchset vt-d often
  raised error message, hung.
 
 It does not convince me, we should do it right at the beginning instead of
 introduce something wrong.
 
 I wonder why the old dma can not be remap to a specific page in kdump kernel
 so that it will not corrupt more memory. But I may missed something, I will
 looking for old threads and catch up.

I have read the old discussion, above way was dropped because it could corrupt
filesystem. Apologize about late commenting.

But current solution sounds bad to me because of using old memory which is not
reliable. 

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-07 Thread Dave Young
On 04/07/15 at 05:55pm, Li, ZhenHua wrote:
 On 04/07/2015 05:08 PM, Dave Young wrote:
 On 04/07/15 at 11:46am, Dave Young wrote:
 On 04/05/15 at 09:54am, Baoquan He wrote:
 On 04/03/15 at 05:21pm, Dave Young wrote:
 On 04/03/15 at 05:01pm, Li, ZhenHua wrote:
 Hi Dave,
 
 There may be some possibilities that the old iommu data is corrupted by
 some other modules. Currently we do not have a better solution for the
 dmar faults.
 
 But I think when this happens, we need to fix the module that corrupted
 the old iommu data. I once met a similar problem in normal kernel, the
 queue used by the qi_* functions was written again by another module.
 The fix was in that module, not in iommu module.
 
 It is too late, there will be no chance to save vmcore then.
 
 Also if it is possible to continue corrupt other area of oldmem because
 of using old iommu tables then it will cause more problems.
 
 So I think the tables at least need some verifycation before being used.
 
 
 Yes, it's a good thinking anout this and verification is also an
 interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
 and then verify this again when panic happens in purgatory. This checks
 whether any code stomps into region reserved for kexec/kernel and corrupt
 the loaded kernel.
 
 If this is decided to do it should be an enhancement to current
 patchset but not a approach change. Since this patchset is going very
 close to point as maintainers expected maybe this can be merged firstly,
 then think about enhancement. After all without this patchset vt-d often
 raised error message, hung.
 
 It does not convince me, we should do it right at the beginning instead of
 introduce something wrong.
 
 I wonder why the old dma can not be remap to a specific page in kdump kernel
 so that it will not corrupt more memory. But I may missed something, I will
 looking for old threads and catch up.
 
 I have read the old discussion, above way was dropped because it could 
 corrupt
 filesystem. Apologize about late commenting.
 
 But current solution sounds bad to me because of using old memory which is 
 not
 reliable.
 
 Thanks
 Dave
 
 Seems we do not have a better solution for the dmar faults.  But I believe
 we can find out how to verify the iommu data which is located in old memory.

That will be great, thanks.

So there's two things:
1) make sure old pg tables are right, this is what we were talking about.
2) avoid writing old memory, I suppose only dma read could corrupt filesystem,
right? So how about for any dma writes just create a scratch page in 2nd kernel
memory. Only using old page table for dma read.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-06 Thread Dave Young
On 04/05/15 at 09:54am, Baoquan He wrote:
 On 04/03/15 at 05:21pm, Dave Young wrote:
  On 04/03/15 at 05:01pm, Li, ZhenHua wrote:
   Hi Dave,
   
   There may be some possibilities that the old iommu data is corrupted by
   some other modules. Currently we do not have a better solution for the
   dmar faults.
   
   But I think when this happens, we need to fix the module that corrupted
   the old iommu data. I once met a similar problem in normal kernel, the
   queue used by the qi_* functions was written again by another module.
   The fix was in that module, not in iommu module.
  
  It is too late, there will be no chance to save vmcore then.
  
  Also if it is possible to continue corrupt other area of oldmem because
  of using old iommu tables then it will cause more problems.
  
  So I think the tables at least need some verifycation before being used.
  
 
 Yes, it's a good thinking anout this and verification is also an
 interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
 and then verify this again when panic happens in purgatory. This checks
 whether any code stomps into region reserved for kexec/kernel and corrupt
 the loaded kernel.
 
 If this is decided to do it should be an enhancement to current
 patchset but not a approach change. Since this patchset is going very
 close to point as maintainers expected maybe this can be merged firstly,
 then think about enhancement. After all without this patchset vt-d often
 raised error message, hung.

It does not convince me, we should do it right at the beginning instead of
introduce something wrong.

I wonder why the old dma can not be remap to a specific page in kdump kernel
so that it will not corrupt more memory. But I may missed something, I will
looking for old threads and catch up.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-06 Thread Dave Young
On 04/03/15 at 02:05pm, Li, Zhen-Hua wrote:
 The hardware will do some verification, but not completely.  If people think 
 the OS should also do this, then it should be another patchset, I think.

If there is chance to corrupt more memory I think it is not a right way.
We should think about a better solution instead of fix it later.

Thanks
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-03 Thread Dave Young
On 04/03/15 at 05:01pm, Li, ZhenHua wrote:
 Hi Dave,
 
 There may be some possibilities that the old iommu data is corrupted by
 some other modules. Currently we do not have a better solution for the
 dmar faults.
 
 But I think when this happens, we need to fix the module that corrupted
 the old iommu data. I once met a similar problem in normal kernel, the
 queue used by the qi_* functions was written again by another module.
 The fix was in that module, not in iommu module.

It is too late, there will be no chance to save vmcore then.

Also if it is possible to continue corrupt other area of oldmem because
of using old iommu tables then it will cause more problems.

So I think the tables at least need some verifycation before being used.

 
 
 Thanks
 Zhenhua
 
 On 04/03/2015 04:40 PM, Dave Young wrote:
 To fix this problem, we modifies the behaviors of the intel vt-d in the
 crashdump kernel:
 
 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table, copy data from the old ones that
 used by the old kernel.
 
 Have not read all the patches, but I have a question, not sure this has been
 answered before. Old memory is not reliable, what if the old memory get 
 corrupted
 before panic? Is it safe to continue using it in 2nd kernel, I worry that it 
 will
 cause problems.
 
 Hope I'm wrong though.
 
 Thanks
 Dave
 
 
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-03 Thread Dave Young
On 03/19/15 at 01:36pm, Li, Zhen-Hua wrote:
 This patchset is an update of Bill Sumner's patchset, implements a fix for:
 If a kernel boots with intel_iommu=on on a system that supports intel vt-d, 
 when a panic happens, the kdump kernel will boot with these faults:
 
 dmar: DRHD: handling fault status reg 102
 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8
 DMAR:[fault reason 01] Present bit in root entry is clear
 
 dmar: DRHD: handling fault status reg 2
 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
 
 On some system, the interrupt remapping fault will also happen even if the 
 intel_iommu is not set to on, because the interrupt remapping will be enabled 
 when x2apic is needed by the system.
 
 The cause of the DMA fault is described in Bill's original version, and the 
 INTR-Remap fault is caused by a similar reason. In short, the initialization 
 of vt-d drivers causes the in-flight DMA and interrupt requests get wrong 
 response.
 
 To fix this problem, we modifies the behaviors of the intel vt-d in the 
 crashdump kernel:
 
 For DMA Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the translation, keep it enabled.
 3. Use the old root entry table, do not rewrite the RTA register.
 4. Malloc and use new context entry table, copy data from the old ones that
used by the old kernel.
 5. Keep using the old page tables before driver is loaded.
 6. After device driver is loaded, when it issues the first dma_map command, 
free the dmar_domain structure for this device, and generate a new one, so 
that the device can be assigned a new and empty page table. 
 7. When a new context entry table is generated, we also save its address to 
the old root entry table.
 
 For Interrupt Remapping:
 1. To accept the vt-d hardware in an active state,
 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
 4. When ioapic entry is setup, the interrupt remapping table is changed, and 
the updated data will be stored to the old interrupt remapping table.
 
 Advantages of this approach:
 1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
 2. This approach behaves in a manner very similar to operation without an
active iommu.
 3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
 4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump. 
 5. Minimal code-changes among the existing mainline intel vt-d code.
 
 Summary of changes in this patch set:
 1. Added some useful function for root entry table in code intel-iommu.c
 2. Added new members to struct root_entry and struct irte;
 3. Functions to load old root entry table to iommu-root_entry from the 
 memory 
of old kernel.
 4. Functions to malloc new context entry table and copy the data from the old
ones to the malloced new ones.
 5. Functions to enable support for DMA remapping in kdump kernel.
 6. Functions to load old irte data from the old kernel to the kdump kernel.
 7. Some code changes that support other behaviours that have been listed.
 8. In the new functions, use physical address as unsigned long type, not 
pointers.
 
 Original version by Bill Sumner:
 https://lkml.org/lkml/2014/1/10/518
 https://lkml.org/lkml/2014/4/15/716
 https://lkml.org/lkml/2014/4/24/836
 
 Zhenhua's updates:
 https://lkml.org/lkml/2014/10/21/134
 https://lkml.org/lkml/2014/12/15/121
 https://lkml.org/lkml/2014/12/22/53
 https://lkml.org/lkml/2015/1/6/1166
 https://lkml.org/lkml/2015/1/12/35
 
 Changelog[v9]:
 1. Add new function iommu_attach_domain_with_id.
 2. Do not copy old page tables, keep using the old ones.
 3. Remove functions:
intel_iommu_did_to_domain_values_entry
intel_iommu_get_dids_from_old_kernel
device_to_domain_id
copy_page_addr
copy_page_table
copy_context_entry
copy_context_entry_table
 4. Add new function device_to_existing_context_entry.
 
 Changelog[v8]:
 1. Add a missing __iommu_flush_cache in function copy_page_table.
 
 Changelog[v7]:
 1. Use __iommu_flush_cache to flush the data to hardware.
 
 Changelog[v6]:
 1. Use unsigned long as type of physical address.
 2. Use new function unmap_device_dma to unmap the old dma.
 3. Some small incorrect bits order for aw shift.
 
 Changelog[v5]:
 1. Do not disable and re-enable traslation and interrupt remapping. 
 2. Use old root entry table.
 3. Use old interrupt 

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-27 Thread Dave Young
On Fri, Jun 14, 2013 at 10:11 AM, Takao Indoh indou.ta...@jp.fujitsu.comwrote:

 (2013/06/13 12:41), Bjorn Helgaas wrote:
  On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh indou.ta...@jp.fujitsu.com
 wrote:
  (2013/06/12 13:45), Bjorn Helgaas wrote:
  [+cc Vivek, Haren; sorry I didn't think to add you earlier]
 
  On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
  indou.ta...@jp.fujitsu.com wrote:
  (2013/06/11 11:20), Bjorn Helgaas wrote:
 
  I'm not sure you need to reset legacy devices (or non-PCI devices)
  yet, but the current hook isn't anchored anywhere -- it's just an
  fs_initcall() that doesn't give the reader any clue about the
  connection between the reset and the problem it's solving.
 
  If we do something like this patch, I think it needs to be done at
 the
  point where we enable or disable the IOMMU.  That way, it's connected
  to the important event, and there's a clue about how to make
  corresponding fixes for other IOMMUs.
 
  Ok. pci_iommu_init() is appropriate place to add this hook?
 
  I looked at various IOMMU init places today, and it's far more
  complicated and varied than I had hoped.
 
  This reset scheme depends on enumerating PCI devices before we
  initialize the IOMMU used by those devices.  x86 works that way today,
  but not all architectures do (see the sparc pci_fire_pbm_init(), for
 
  Sorry, could you tell me which part depends on architecture?
 
  Your patch works if PCIe devices are reset before the kdump kernel
  enables the IOMMU.  On x86, this is possible because PCI enumeration
  happens before the IOMMU initialization.  On sparc, the IOMMU is
  initialized before PCI devices are enumerated, so there would still be
  a window where ongoing DMA could cause an IOMMU error.

 Ok, understood, thanks.

 Hmmm, it seems to be difficult to find out method which is common to
 all architectures. So, what I can do for now is introducing reset scheme
 which is only for x86.

 1) Change this patch so that it work only on x86 platform. For example
call this reset code from x86_init.iommu.iommu_init() instead of
fs_initcall.

 Or another idea is:

 2) Enumerate PCI devices in IOMMU layer. That is:
PCI layer
  Just provide interface to reset given strcut pci_dev. Maybe
  pci_reset_function() looks good for this purpose.
IOMMU layer
  Determine which devices should be reset. On kernel boot, check if
  IOMMU is already active or not, and if active, check IOMMU page
  table and reset devices whose entry exists there.

  Of course, it might be possible to reorganize the sparc code to to the
  IOMMU init *after* it enumerates PCI devices.  But I think that change
  would be hard to justify.
 
  And I think even on x86, it would be better if we did the IOMMU init
  before PCI enumeration -- the PCI devices depend on the IOMMU, so
  logically the IOMMU should be initialized first so the PCI devices can
  be associated with it as they are enumerated.

 So third idea is:

 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
need to implement new code to enumerate PCI devices and reset them
for this purpose.

 Idea 2 is not difficult to implement, but one problem is that this
 method may be dangerous. We need to scan IOMMU page table which is used
 in previous kernel, but it may be broken. Idea 3 seems to be difficult
 to implement...


 
  example).  And I think conceptually, the IOMMU should be enumerated
  and initialized *before* the devices that use it.
 
  So I'm uncomfortable with that aspect of this scheme.
 
  It would be at least conceivable to reset the devices in the system
  kernel, before the kexec.  I know we want to do as little as possible
  in the crashing kernel, but it's at least a possibility, and it might
  be cleaner.
 
  I bet this will be not accepted by kdump maintainer. Everything in panic
  kernel is unreliable.
 
  kdump is inherently unreliable.  The kdump kernel doesn't start from
  an arbitrary machine state.  We don't expect it to tolerate all CPUs
  running, for example.  Maybe it should be expected to tolerate PCI
  devices running, either.

 What I wanted to say is that any resources of first kernel are
 unreliable. Under panic situation, struct pci_dev tree may be broken, or
 pci_lock may be already hold by someone, etc. So, if we do this in first
 kernel, maybe kdump needs its own code to enumerate PCI devices and
 reset them.

 Vivek?


Ping Vivek



 Thanks,
 Takao Indoh

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/




-- 
Regards
Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu