Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address
Hi, [snip] > 2) Reuse the SWIOTLB from the previous boot on kexec/kdump We should only care about kdump > > I see little direct relation to SEV here. The only reason SEV makes it more > relevant, is that you need to have an SWIOTLB region available with SEV > while without you could live with a disabled IOMMU. Here is some comment in arch/x86/kernel/pci-swiotlb.c, it is enforced for some reason. /* * If SME is active then swiotlb will be set to 1 so that bounce * buffers are allocated and used for devices that do not support * the addressing range required for the encryption mask. */ if (sme_active()) swiotlb = 1; > > However, I can definitely understand how you would want to have a way to > tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its > memory for its own SWIOTLB. That way, you don't have to reserve another 64MB > of RAM for kdump. > > What I'm curious on is whether we need to be as elaborate. Can't we just > pass the old SWIOTLB as free memory to the new kexec'ed kernel and > everything else will fall into place? All that would take is a bit of > shuffling on the e820 table pass-through to the kexec'ed kernel, no? Maybe either of the two is fine. But we may need ensure these swiotlb area to be reused explictly in some way. Say about the crashkernel=X,high case, major part is in above 4G region, and a small piece in low memory. Then when kernel booting, kernel/driver initialization could use out of the low memory, and the remain part for swiotlb could be not big enough and finally swiotlb allocation fails. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address
On 03/30/20 at 09:40am, Konrad Rzeszutek Wilk wrote: > On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote: > > On Sat, Mar 28, 2020 at 7:57 PM Dave Young wrote: > > > > > > On 03/26/20 at 05:29pm, Alexander Graf wrote: > > > > The swiotlb is a very convenient fallback mechanism for bounce > > > > buffering of > > > > DMAable data. It is usually used for the compatibility case where > > > > devices > > > > can only DMA to a "low region". > > > > > > > > However, in some scenarios this "low region" may be bound even more > > > > heavily. For example, there are embedded system where only an SRAM > > > > region > > > > is shared between device and CPU. There are also heterogeneous computing > > > > scenarios where only a subset of RAM is cache coherent between the > > > > components of the system. There are partitioning hypervisors, where > > > > a "control VM" that implements device emulation has limited view into a > > > > partition's memory for DMA capabilities due to safety concerns. > > > > > > > > This patch adds a command line driven mechanism to move all DMA memory > > > > into > > > > a predefined shared memory region which may or may not be part of the > > > > physical address layout of the Operating System. > > > > > > > > Ideally, the typical path to set this configuration would be through > > > > Device > > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. > > > > Also, > > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but > > > > instead configure the system purely through kernel command line options. > > > > > > > > I'm sure other people will find the functionality useful going forward > > > > though and extend it to be triggered by DT/ACPI in the future. > > > > > > Hmm, we have a use case for kdump, this maybe useful. For example > > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump > > > kernel we have to increase the crashkernel reserved size for the extra > > > swiotlb requirement. I wonder if we can just reuse the old kernel's > > > swiotlb region and pass the addr to kdump kernel. > > > > > > > Yes, definitely helpful for kdump kernel. This can help reduce the > > crashkernel value. > > > > Previously I was thinking about something similar, play around the > > e820 entry passed to kdump and let it place swiotlb in wanted region. > > Simply remap it like in this patch looks much cleaner. > > > > If this patch is acceptable, one more patch is needed to expose the > > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to > > second kernel. > > We seem to be passsing a lot of data to kexec.. Perhaps something > of a unified way since we seem to have a lot of things to pass - disabling > IOMMU, ACPI RSDT address, and then this. acpi_rsdp kernel cmdline is not useful anymore. The initial purpose is for kexec-rebooting in efi system. But now we have supported efi boot across kexec reboot thus that is useless now. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address
On 03/26/20 at 05:29pm, Alexander Graf wrote: > The swiotlb is a very convenient fallback mechanism for bounce buffering of > DMAable data. It is usually used for the compatibility case where devices > can only DMA to a "low region". > > However, in some scenarios this "low region" may be bound even more > heavily. For example, there are embedded system where only an SRAM region > is shared between device and CPU. There are also heterogeneous computing > scenarios where only a subset of RAM is cache coherent between the > components of the system. There are partitioning hypervisors, where > a "control VM" that implements device emulation has limited view into a > partition's memory for DMA capabilities due to safety concerns. > > This patch adds a command line driven mechanism to move all DMA memory into > a predefined shared memory region which may or may not be part of the > physical address layout of the Operating System. > > Ideally, the typical path to set this configuration would be through Device > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also, > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but > instead configure the system purely through kernel command line options. > > I'm sure other people will find the functionality useful going forward > though and extend it to be triggered by DT/ACPI in the future. Hmm, we have a use case for kdump, this maybe useful. For example swiotlb is enabled by default if AMD SME/SEV is active, and in kdump kernel we have to increase the crashkernel reserved size for the extra swiotlb requirement. I wonder if we can just reuse the old kernel's swiotlb region and pass the addr to kdump kernel. > > Signed-off-by: Alexander Graf > --- > Documentation/admin-guide/kernel-parameters.txt | 3 +- > Documentation/x86/x86_64/boot-options.rst | 4 ++- > kernel/dma/swiotlb.c| 46 > +++-- > 3 files changed, 49 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index c07815d230bc..d085d55c3cbe 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -4785,11 +4785,12 @@ > it if 0 is given (See > Documentation/admin-guide/cgroup-v1/memory.rst) > > swiotlb=[ARM,IA-64,PPC,MIPS,X86] > - Format: { | force | noforce } > + Format: { | force | noforce | addr= } >-- Number of I/O TLB slabs > force -- force using of bounce buffers even if they >wouldn't be automatically used by the kernel > noforce -- Never use bounce buffers (for debugging) > + addr= -- Try to allocate SWIOTLB at defined > address > > switches= [HW,M68k] > > diff --git a/Documentation/x86/x86_64/boot-options.rst > b/Documentation/x86/x86_64/boot-options.rst > index 2b98efb5ba7f..ca46c57b68c9 100644 > --- a/Documentation/x86/x86_64/boot-options.rst > +++ b/Documentation/x86/x86_64/boot-options.rst > @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART hardware > IOMMU: > iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU > implementation: > > -swiotlb=[,force] > +swiotlb=[,force][,addr=] > > Prereserve that many 128K pages for the software IO bounce buffering. >force > Force all IO through the software TLB. > + addr= > +Try to allocate SWIOTLB at defined address > > Settings for the IBM Calgary hardware IOMMU currently found in IBM > pSeries and xSeries machines > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index c19379fabd20..83da0caa2f93 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > #define CREATE_TRACE_POINTS > #include > @@ -102,6 +103,12 @@ unsigned int max_segment; > #define INVALID_PHYS_ADDR (~(phys_addr_t)0) > static phys_addr_t *io_tlb_orig_addr; > > +/* > + * The TLB phys addr may be defined on the command line. Store it here if it > is. > + */ > +static phys_addr_t io_tlb_addr = INVALID_PHYS_ADDR; > + > + > /* > * Protect the above data structures in the map and unmap calls > */ > @@ -119,11 +126,23 @@ setup_io_tlb_npages(char *str) > } > if (*str == ',') > ++str; > - if (!strcmp(str, "force")) { > + if (!strncmp(str, "force", 5)) { > swiotlb_force = SWIOTLB_FORCE; > - } else if (!strcmp(str, "noforce")) { > + str += 5; > + } else if (!strncmp(str, "noforce", 7)) { > swiotlb_force = SWIOTLB_NO_FORCE; > io_tlb_nslabs = 1; > + str += 7; > + } > + > + if (*str == ',') > +
Re: Crash kernel with 256 MB reserved memory runs into OOM condition
On 08/13/19 at 10:46am, Dave Young wrote: > Add more cc. > On 08/13/19 at 10:43am, Dave Young wrote: > > Hi, > > > > On 08/12/19 at 11:50am, Michal Hocko wrote: > > > On Mon 12-08-19 11:42:33, Paul Menzel wrote: > > > > Dear Linux folks, > > > > > > > > > > > > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and > > > > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes. > > > > > > > > Please find the messages of the normal and the crash kernel attached. > > > > > > You will need more memory to reserve for the crash kernel because ... > > > > > > > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB > > > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > > > > unevictable:0kB writepending:0kB present:568kB managed:484kB > > > > mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB > > > > local_pcp:0kB free_cma:0kB > > > > [4.573612] lowmem_reserve[]: 0 125 125 125 > > > > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB > > > > high:2140kB active_anon:0kB inactive_anon:0kB active_file:0kB > > > > inactive_file:0kB unevictable:15720kB writepending:0kB present:261560kB > > > > managed:133752kB mlocked:0kB kernel_stack:2496kB pagetables:0kB > > > > bounce:0kB free_pcp:212kB local_pcp:212kB free_cma:0kB > > > > > > ... the memory is really depleted and nothing to be reclaimed (no anon. > > > file pages) Look how tht free memory is below min watermark (node zone > > > DMA has > > > lowmem protection for GFP_KERNEL allocation). > > > > We found similar issue on our side while working on kdump on SME enabled > > systemd. Kairui is working on some patches. > > > > Actually on those SME/SEV enabled machines, swiotlb is enabled > > automatically so at least we need extra 64M+ memory for kdump other > > than the normal expectation. > > > > Can you check if this is also your case? > > The question is to Paul, also it would be always good to cc kexec mail > list for kexec and kdump issues. Looks like hardware iommu is used, maybe you do not enable SME? Also replace maxcpus=1 with nr_cpus=1 can save some memory, can have a try. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Crash kernel with 256 MB reserved memory runs into OOM condition
Add more cc. On 08/13/19 at 10:43am, Dave Young wrote: > Hi, > > On 08/12/19 at 11:50am, Michal Hocko wrote: > > On Mon 12-08-19 11:42:33, Paul Menzel wrote: > > > Dear Linux folks, > > > > > > > > > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and > > > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes. > > > > > > Please find the messages of the normal and the crash kernel attached. > > > > You will need more memory to reserve for the crash kernel because ... > > > > > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB > > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > > > unevictable:0kB writepending:0kB present:568kB managed:484kB mlocked:0kB > > > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB > > > free_cma:0kB > > > [4.573612] lowmem_reserve[]: 0 125 125 125 > > > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB high:2140kB > > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > > > unevictable:15720kB writepending:0kB present:261560kB managed:133752kB > > > mlocked:0kB kernel_stack:2496kB pagetables:0kB bounce:0kB free_pcp:212kB > > > local_pcp:212kB free_cma:0kB > > > > ... the memory is really depleted and nothing to be reclaimed (no anon. > > file pages) Look how tht free memory is below min watermark (node zone DMA > > has > > lowmem protection for GFP_KERNEL allocation). > > We found similar issue on our side while working on kdump on SME enabled > systemd. Kairui is working on some patches. > > Actually on those SME/SEV enabled machines, swiotlb is enabled > automatically so at least we need extra 64M+ memory for kdump other > than the normal expectation. > > Can you check if this is also your case? The question is to Paul, also it would be always good to cc kexec mail list for kexec and kdump issues. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Crash kernel with 256 MB reserved memory runs into OOM condition
Hi, On 08/12/19 at 11:50am, Michal Hocko wrote: > On Mon 12-08-19 11:42:33, Paul Menzel wrote: > > Dear Linux folks, > > > > > > On a Dell PowerEdge R7425 with two AMD EPYC 7601 (total 128 threads) and > > 1 TB RAM, the crash kernel with 256 MB of space reserved crashes. > > > > Please find the messages of the normal and the crash kernel attached. > > You will need more memory to reserve for the crash kernel because ... > > > [4.548703] Node 0 DMA free:484kB min:4kB low:4kB high:4kB > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > > unevictable:0kB writepending:0kB present:568kB managed:484kB mlocked:0kB > > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB > > free_cma:0kB > > [4.573612] lowmem_reserve[]: 0 125 125 125 > > [4.577799] Node 0 DMA32 free:1404kB min:1428kB low:1784kB high:2140kB > > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > > unevictable:15720kB writepending:0kB present:261560kB managed:133752kB > > mlocked:0kB kernel_stack:2496kB pagetables:0kB bounce:0kB free_pcp:212kB > > local_pcp:212kB free_cma:0kB > > ... the memory is really depleted and nothing to be reclaimed (no anon. > file pages) Look how tht free memory is below min watermark (node zone DMA has > lowmem protection for GFP_KERNEL allocation). We found similar issue on our side while working on kdump on SME enabled systemd. Kairui is working on some patches. Actually on those SME/SEV enabled machines, swiotlb is enabled automatically so at least we need extra 64M+ memory for kdump other than the normal expectation. Can you check if this is also your case? Thanks Dave
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
Hi Pingfan, Thanks for the effort, On 03/01/19 at 11:19am, Pingfan Liu wrote: > On Fri, Mar 1, 2019 at 11:04 AM Pingfan Liu wrote: > > > > Hi Borislav, > > > > Do you think the following patch is good at present? > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > > index 81f9d23..9213073 100644 > > --- a/arch/x86/kernel/setup.c > > +++ b/arch/x86/kernel/setup.c > > @@ -460,7 +460,7 @@ static void __init > > memblock_x86_reserve_range_setup_data(void) > > # define CRASH_ADDR_LOW_MAX(512 << 20) > > # define CRASH_ADDR_HIGH_MAX (512 << 20) > > #else > > -# define CRASH_ADDR_LOW_MAX(896UL << 20) > > +# define CRASH_ADDR_LOW_MAX(1 << 32) > > # define CRASH_ADDR_HIGH_MAX MAXMEM > > #endif > > > Or patch lools like: > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 3d872a5..ed0def5 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -459,7 +459,7 @@ static void __init > memblock_x86_reserve_range_setup_data(void) > # define CRASH_ADDR_LOW_MAX(512 << 20) > # define CRASH_ADDR_HIGH_MAX (512 << 20) > #else > -# define CRASH_ADDR_LOW_MAX(896UL << 20) > +# define CRASH_ADDR_LOW_MAX(1 << 32) > # define CRASH_ADDR_HIGH_MAX MAXMEM > #endif > > @@ -551,6 +551,15 @@ static void __init reserve_crashkernel(void) > high ? CRASH_ADDR_HIGH_MAX > : CRASH_ADDR_LOW_MAX, > crash_size, CRASH_ALIGN); > +#ifdef CONFIG_X86_64 > + /* > +* crashkernel=X reserve below 4G fails? Try MAXMEM > +*/ > + if (!high && !crash_base) > + crash_base = memblock_find_in_range(CRASH_ALIGN, > + CRASH_ADDR_HIGH_MAX, > + crash_size, CRASH_ALIGN); > +#endif > > which tries 0-4G, the fall back to 4G above This way looks good to me, I will do some testing with old kexec-tools, Once testing done I can take up this again and repost later with some documentation update. Also will split to 2 patches one to drop the old limitation, another for the fallback. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On 02/25/19 at 12:00pm, Joerg Roedel wrote: > On Fri, Feb 22, 2019 at 02:00:26PM +0100, Borislav Petkov wrote: > > On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote: > > > The current default of 256MB was found by experiments on a bigger > > > number of machines, to create a reasonable default that is at least > > > likely to be sufficient of an average machine. > > > > Exactly, and this is what makes sense. > > > > The code should try the requested reservation and if it fails, it should > > try high allocation with default swiotlb size because we need to reserve > > *some* range. > > Right, makes sense. While at it, maybe it is time to move the default > allocation policy to 'high' again. The change was reverted six years ago > because it broke old kexec tools, but those are probably out-of-service > now. I think this change would make the whole crashdump allocation > process less fragile. One concern about this is for average cases, one do not need so much memory for kdump. For example in RHEL we use crashkernel=auto to automatically reserve kdump kernel memory, and for x86 the reserved size is like below now: 1G-64G:160M,64G-1T:256M,1T-:512M That means for a machine with less than 64G memory we only allocate 160M, it works for most machines in our lab. If we move to high as default, it will allocate 160M high + 256M low. It is too much for people who is good with the default 160M. Especially for virtual machine with less memory (but > 4G) To make the process less fragile maybe we can remove the 896M limitation and only try <4G then go to high. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On 02/24/19 at 09:25pm, Pingfan Liu wrote: > On Fri, Feb 22, 2019 at 9:00 PM Borislav Petkov wrote: > > > > On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote: > > > The current default of 256MB was found by experiments on a bigger > > > number of machines, to create a reasonable default that is at least > > > likely to be sufficient of an average machine. > > > > Exactly, and this is what makes sense. > > > > The code should try the requested reservation and if it fails, it should > > try high allocation with default swiotlb size because we need to reserve > > *some* range. > > > > If that reservation succeeds, we should say something along the lines of > > > > "... requested range failed, reserved range instead." > > > Maybe I misunderstood you, but does "requested range failed" mean that > user specify the range? If yes, then it should be the duty of user as > you said later, not the duty of kernel" If you go with the changes in your current patch it is needed to say something like: "crashkernel: can not find free memory under 4G, reserve XM@.. instead" Also need to print the reserved low memory area in case ,high being used. But for 896M -> 4G, the 896M faulure is not necessary to show in dmesg, it is some in kernel logic. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On 02/21/19 at 06:13pm, Borislav Petkov wrote: > On Wed, Feb 20, 2019 at 05:41:46PM +0800, Dave Young wrote: > > Previously Joerg posted below patch, maybe he has some idea. Joerg? > > Isn't it clear from the commit message? Then, does it answered your question? 256M is set as a default value in the patch, but it is not a predict to satisfy all use cases, from the description it is also possible that some people run out of the 256M and the ,low and ,high format is still necessary to exist even if we make crashkernel=X do the allocation automatically in high in case failed in low area. crashkernel=X: allocate in low first, if not possible, then allocate in high In case people have a lot of devices need more swiotlb, then he manually set the ,high with ,low together. What's your suggestion then? remove ,low and ,high and increase default 256M in case we get failure bug report? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On 02/20/19 at 09:32am, Borislav Petkov wrote: > On Mon, Feb 18, 2019 at 09:48:20AM +0800, Dave Young wrote: > > It is ideal if kernel can do it automatically, but I'm not sure if > > kernel can predict the swiotlb reserved size automatically. > > Do you see how even more absurd this gets? > > If the kernel cannot know the swiotlb reserved size automatically, how > is the normal user even supposed to know?! > > I see swiotlb_size_or_default() so we have a sane default which we fall > back to. Now where's the problem with that? Good question, I expect some answer from people who know more about the background. It would be good to have some actual test results, Pingfan is trying to do some tests. Previously Joerg posted below patch, maybe he has some idea. Joerg? commit 94fb9334182284e8e7e4bcb9125c25dc33af19d4 Author: Joerg Roedel Date: Wed Jun 10 17:49:42 2015 +0200 x86/crash: Allocate enough low memory when crashkernel=high Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On 02/15/19 at 11:24am, Borislav Petkov wrote: > On Tue, Feb 12, 2019 at 04:48:16AM +0800, Dave Young wrote: > > Even we make it automatic in kernel, but we have to have some default > > value for swiotlb in case crashkernel can not find a free region under 4G. > > So this default value can not work for every use cases, people need > > manually use crashkernel=,low and crashkernel=,high in case > > crashkernel=X does not work. > > Why would the user need to find swiotlb range? The kernel has all the > information it requires at its finger tips in order to decide properly. > > The user wants a crashkernel range, the kernel tries the low range => > no workie, then it tries the next range => workie but needs to allocate > swiotlb range so that DMA can happen too. Doh, then the kernel does > allocate that too. It is ideal if kernel can do it automatically, but I'm not sure if kernel can predict the swiotlb reserved size automatically. Let's add more people to seek for comments. > > Why would the user need to do anything here?! > > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask
[snip] > > As previously mentioned, there are also many differences between kexec and > kdump. In general, > kexec needs to look at all of available physical memory, but kdump doesn't > need. > > For kexec, kexec-tools will read /sys/firmware/memmap and recreate the e820 > ranges for the 2nd > kernel. If it fails, will use /proc/iomem. > > For kdump, kexec-tools will read /proc/iomem and recreate the e820 ranges for > kdump kernel. > BTW: we can not get the range of persistent memory from /proc/iomem. So e820 > ranges don't contain > the persistent memory in kdump kernel, this is the real reason why i need to > strengthen the logic > of adjusting memory encryption mask. "persistent memory" is different, I think you meant about some reserved memory instead > > If kexec-tools also use /sys/firmware/memmap for kdump(like kexec), kdump > kernel can also work > without a fix, but the kexec-tools will have to be modified. Are you sure > that you want me to > fix kexec-tools instead of kernel? Yes, please fix kexec-tools to pass reserved ranges in e820, you will not need this patch then. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask
On 09/04/18 at 09:29am, Dave Young wrote: > On 09/04/18 at 08:44am, Dave Young wrote: > > On 09/03/18 at 10:06pm, lijiang wrote: > > > 在 2018年09月03日 10:45, Dave Young 写道: > > > > On 08/31/18 at 04:19pm, Lianbo Jiang wrote: > > > >> For kdump kernel, when SME is enabled, the acpi table and dmi table > > > >> will need > > > >> to be remapped without the memory encryption mask. So we have to > > > >> strengthen > > > >> the logic in early_memremap_pgprot_adjust(), which makes us have an > > > >> opportunity > > > >> to adjust the memory encryption mask. > > > >> > > > >> Signed-off-by: Lianbo Jiang > > > >> --- > > > >> arch/x86/mm/ioremap.c | 9 - > > > >> 1 file changed, 8 insertions(+), 1 deletion(-) > > > >> > > > >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c > > > >> index e01e6c695add..f9d9a39955f3 100644 > > > >> --- a/arch/x86/mm/ioremap.c > > > >> +++ b/arch/x86/mm/ioremap.c > > > >> @@ -689,8 +689,15 @@ pgprot_t __init > > > >> early_memremap_pgprot_adjust(resource_size_t phys_addr, > > > >>encrypted_prot = true; > > > >> > > > >>if (sme_active()) { > > > >> +/* > > > >> + * In kdump kernel, the acpi table and dmi table will > > > >> need > > > >> + * to be remapped without the memory encryption mask. > > > >> Here > > > >> + * we have to strengthen the logic to adjust the > > > >> memory > > > >> + * encryption mask. > > > > > > > > Assume the acpi/dmi tables are identical for both 1st kernel and kdump > > > > kernel, I'm not sure what is the difference, why need special handling > > > > for kdump. Can you add more explanations? > > > > > > > > > > Ok, i will use a dmi example to explain this issue. > > > > > > There are significant differences about E820 between the 1st kernel and > > > kdump kernel. I pasted them at bottom. > > > > > > Firstly, we need to know how they are called. > > > __acpi_map_table()\ > > > / early_memremap_is_setup_data() > > >|-> early_memremap()-> > > > early_memremap_pgprot_adjust()-> | memremap_is_efi_data() > > > dmi_early_remap()/ > > > \ memremap_should_map_decrypted()-> e820__get_entry_type() > > > > > > Secondly, we also need to understand the memremap_should_map_decrypted(), > > > which is illustrated by the fake code. > > > static bool memremap_should_map_decrypted(resource_size_t phys_addr, > > > unsigned long size) > > > { > > > > > > /* code ... */ > > > > > > switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) { > > > case E820_TYPE_RESERVED: > > > case E820_TYPE_ACPI: > > > case E820_TYPE_NVS: > > > case E820_TYPE_UNUSABLE: > > > /* For SEV, these areas are encrypted */ > > > if (sev_active()) > > > break; > > > /* Fallthrough */ > > > > > > case E820_TYPE_PRAM: > > > /* For SME, these areas are decrypted */ > > > return true; > > > default: > > > /* these areas are encrypted by default*/ > > > break; > > > } > > > > > > return false; > > > } > > > > > > For the dmi case, the dmi base address is 0x6286b000 in my test machine. > > > > > > In the 1st kernel, the e820__get_entry_type() can get a valid entry and > > > type by the dmi address, and we can also find the dmi base address from > > > e820. > > > (see the 1st kernel log) > > > 0x6286b000 ∈ [mem 0x6286b000-0x6286efff] > > > So, these areas are decrypted according to the > > > memremap_should_map_decrypted(). > > > > > > In kdump kernel, the dmi base address is still 0x6286b000, but we can not > > > find the dmi base address
Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask
On 09/04/18 at 08:44am, Dave Young wrote: > On 09/03/18 at 10:06pm, lijiang wrote: > > 在 2018年09月03日 10:45, Dave Young 写道: > > > On 08/31/18 at 04:19pm, Lianbo Jiang wrote: > > >> For kdump kernel, when SME is enabled, the acpi table and dmi table will > > >> need > > >> to be remapped without the memory encryption mask. So we have to > > >> strengthen > > >> the logic in early_memremap_pgprot_adjust(), which makes us have an > > >> opportunity > > >> to adjust the memory encryption mask. > > >> > > >> Signed-off-by: Lianbo Jiang > > >> --- > > >> arch/x86/mm/ioremap.c | 9 - > > >> 1 file changed, 8 insertions(+), 1 deletion(-) > > >> > > >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c > > >> index e01e6c695add..f9d9a39955f3 100644 > > >> --- a/arch/x86/mm/ioremap.c > > >> +++ b/arch/x86/mm/ioremap.c > > >> @@ -689,8 +689,15 @@ pgprot_t __init > > >> early_memremap_pgprot_adjust(resource_size_t phys_addr, > > >> encrypted_prot = true; > > >> > > >> if (sme_active()) { > > >> +/* > > >> + * In kdump kernel, the acpi table and dmi table will > > >> need > > >> + * to be remapped without the memory encryption mask. > > >> Here > > >> + * we have to strengthen the logic to adjust the memory > > >> + * encryption mask. > > > > > > Assume the acpi/dmi tables are identical for both 1st kernel and kdump > > > kernel, I'm not sure what is the difference, why need special handling > > > for kdump. Can you add more explanations? > > > > > > > Ok, i will use a dmi example to explain this issue. > > > > There are significant differences about E820 between the 1st kernel and > > kdump kernel. I pasted them at bottom. > > > > Firstly, we need to know how they are called. > > __acpi_map_table()\ > > / early_memremap_is_setup_data() > >|-> early_memremap()-> early_memremap_pgprot_adjust()-> > > | memremap_is_efi_data() > > dmi_early_remap()/ > > \ memremap_should_map_decrypted()-> e820__get_entry_type() > > > > Secondly, we also need to understand the memremap_should_map_decrypted(), > > which is illustrated by the fake code. > > static bool memremap_should_map_decrypted(resource_size_t phys_addr, > > unsigned long size) > > { > > > > /* code ... */ > > > > switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) { > > case E820_TYPE_RESERVED: > > case E820_TYPE_ACPI: > > case E820_TYPE_NVS: > > case E820_TYPE_UNUSABLE: > > /* For SEV, these areas are encrypted */ > > if (sev_active()) > > break; > > /* Fallthrough */ > > > > case E820_TYPE_PRAM: > > /* For SME, these areas are decrypted */ > > return true; > > default: > > /* these areas are encrypted by default*/ > > break; > > } > > > > return false; > > } > > > > For the dmi case, the dmi base address is 0x6286b000 in my test machine. > > > > In the 1st kernel, the e820__get_entry_type() can get a valid entry and > > type by the dmi address, and we can also find the dmi base address from > > e820. > > (see the 1st kernel log) > > 0x6286b000 ∈ [mem 0x6286b000-0x6286efff] > > So, these areas are decrypted according to the > > memremap_should_map_decrypted(). > > > > In kdump kernel, the dmi base address is still 0x6286b000, but we can not > > find the dmi base address from e820 any more. The e820__get_entry_type() can > > not get a valid entry and type by the dmi base address, it will go into the > > default branch. That is to say, these areas become encrypted. In fact, these > > areas are also decrypted, so we have to strengthen the logic of adjusting > > the memory encryption mask. > > > > > > The 1st kernel log: > > > > [0.00] BIOS-provided physical RAM map: > > [0.00] BIOS-e820: [mem 0x-0x0
Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask
On 09/03/18 at 10:06pm, lijiang wrote: > 在 2018年09月03日 10:45, Dave Young 写道: > > On 08/31/18 at 04:19pm, Lianbo Jiang wrote: > >> For kdump kernel, when SME is enabled, the acpi table and dmi table will > >> need > >> to be remapped without the memory encryption mask. So we have to strengthen > >> the logic in early_memremap_pgprot_adjust(), which makes us have an > >> opportunity > >> to adjust the memory encryption mask. > >> > >> Signed-off-by: Lianbo Jiang > >> --- > >> arch/x86/mm/ioremap.c | 9 - > >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c > >> index e01e6c695add..f9d9a39955f3 100644 > >> --- a/arch/x86/mm/ioremap.c > >> +++ b/arch/x86/mm/ioremap.c > >> @@ -689,8 +689,15 @@ pgprot_t __init > >> early_memremap_pgprot_adjust(resource_size_t phys_addr, > >>encrypted_prot = true; > >> > >>if (sme_active()) { > >> +/* > >> + * In kdump kernel, the acpi table and dmi table will need > >> + * to be remapped without the memory encryption mask. Here > >> + * we have to strengthen the logic to adjust the memory > >> + * encryption mask. > > > > Assume the acpi/dmi tables are identical for both 1st kernel and kdump > > kernel, I'm not sure what is the difference, why need special handling > > for kdump. Can you add more explanations? > > > > Ok, i will use a dmi example to explain this issue. > > There are significant differences about E820 between the 1st kernel and kdump > kernel. I pasted them at bottom. > > Firstly, we need to know how they are called. > __acpi_map_table()\/ > early_memremap_is_setup_data() >|-> early_memremap()-> early_memremap_pgprot_adjust()-> | > memremap_is_efi_data() > dmi_early_remap()/\ > memremap_should_map_decrypted()-> e820__get_entry_type() > > Secondly, we also need to understand the memremap_should_map_decrypted(), > which is illustrated by the fake code. > static bool memremap_should_map_decrypted(resource_size_t phys_addr, > unsigned long size) > { > > /* code ... */ > > switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) { > case E820_TYPE_RESERVED: > case E820_TYPE_ACPI: > case E820_TYPE_NVS: > case E820_TYPE_UNUSABLE: > /* For SEV, these areas are encrypted */ > if (sev_active()) > break; > /* Fallthrough */ > > case E820_TYPE_PRAM: > /* For SME, these areas are decrypted */ > return true; > default: > /* these areas are encrypted by default*/ > break; > } > > return false; > } > > For the dmi case, the dmi base address is 0x6286b000 in my test machine. > > In the 1st kernel, the e820__get_entry_type() can get a valid entry and type > by the dmi address, and we can also find the dmi base address from e820. > (see the 1st kernel log) > 0x6286b000 ∈ [mem 0x6286b000-0x6286efff] > So, these areas are decrypted according to the > memremap_should_map_decrypted(). > > In kdump kernel, the dmi base address is still 0x6286b000, but we can not > find the dmi base address from e820 any more. The e820__get_entry_type() can > not get a valid entry and type by the dmi base address, it will go into the > default branch. That is to say, these areas become encrypted. In fact, these > areas are also decrypted, so we have to strengthen the logic of adjusting the > memory encryption mask. > > > The 1st kernel log: > > [0.00] BIOS-provided physical RAM map: > [0.00] BIOS-e820: [mem 0x-0x0008bfff] usable > [0.00] BIOS-e820: [mem 0x0008c000-0x0009] reserved > [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved > [0.00] BIOS-e820: [mem 0x0010-0x29920fff] usable > [0.00] BIOS-e820: [mem 0x29921000-0x29921fff] reserved > [0.00] BIOS-e820: [mem 0x29922000-0x62256fff] usable > [0.00] BIOS-e820: [mem 0x62257000-0x62356fff] reserved > [0.00] BIOS-e820: [mem 0x62357000-0x6235cfff] ACPI > data >
Re: [PATCH 2/5 V6] x86/ioremap: strengthen the logic in early_memremap_pgprot_adjust() to adjust encryption mask
On 08/31/18 at 04:19pm, Lianbo Jiang wrote: > For kdump kernel, when SME is enabled, the acpi table and dmi table will need > to be remapped without the memory encryption mask. So we have to strengthen > the logic in early_memremap_pgprot_adjust(), which makes us have an > opportunity > to adjust the memory encryption mask. > > Signed-off-by: Lianbo Jiang > --- > arch/x86/mm/ioremap.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c > index e01e6c695add..f9d9a39955f3 100644 > --- a/arch/x86/mm/ioremap.c > +++ b/arch/x86/mm/ioremap.c > @@ -689,8 +689,15 @@ pgprot_t __init > early_memremap_pgprot_adjust(resource_size_t phys_addr, > encrypted_prot = true; > > if (sme_active()) { > +/* > + * In kdump kernel, the acpi table and dmi table will need > + * to be remapped without the memory encryption mask. Here > + * we have to strengthen the logic to adjust the memory > + * encryption mask. Assume the acpi/dmi tables are identical for both 1st kernel and kdump kernel, I'm not sure what is the difference, why need special handling for kdump. Can you add more explanations? > + */ > if (early_memremap_is_setup_data(phys_addr, size) || > - memremap_is_efi_data(phys_addr, size)) > + memremap_is_efi_data(phys_addr, size) || > + is_kdump_kernel()) > encrypted_prot = false; > } > > -- > 2.17.1 > Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/5 V5] Add a function(ioremap_encrypted) for kdump when AMD sme enabled
On 07/20/18 at 01:06pm, lijiang wrote: > 在 2018年07月14日 01:08, Borislav Petkov 写道: > > On Mon, Jul 09, 2018 at 09:55:35PM +0800, lijiang wrote: > >> About this issue, i want to use an example to describe it. > >> /* drivers/iommu/amd_iommu_init.c */ > >> static u8 __iomem * __init iommu_map_mmio_space(u64 address, u64 end) > > > > Those addresses come from the IVHD header which is an ACPI table. So the > > dump kernel can find that out too. > > Sure. I might understand your means, that will have to find all address out > > in > order to cover any cases in kdump kernel, those address might include MMIO > space, HPET, ACPI device table, ERST, and so on... > > >> Obviously, the iommu mmio space is not encrypted, and the device > >> mmio space is outside kdump kernel. We know that the old memory is > >> encrypted, and the old memory is also outside kdump kernel. For the > >> current case, e820__get_entry_type() and walk_iomem_res_desc() can't > >> get the desired result, so we can't also decide whether encryption > >> or not according to this result(rules). If we want to know whether > >> encryption or not by deducing the address, we will need to read the > >> content of memory and have a reference value for comparison, then > >> what's a reference value? Sometimes we don't know that. > > > > Again, if we don't know that how is the *caller* supposed to know > > whether the memory is encrypted or not? Because > > > > "we" == "caller" > > > > in the kdump kernel. > > > > And the more important question is, why are we dumping MMIO space of the > > previous kernel *at* *all*? That doesn't make any sense to me. > > > Sorry for my late reply. > Here, it doesn't need to dump MMIO space of the previous kernel, when the > kdump kernel boot, the MMIO address will be remapped in decryption manners, > but the MMIO address don't belong to the range of the crash reserved memory, > for the kdump kernel, the MMIO space(address) and IOMMU device table(address) > are outside address, whereas, the IOMMU device table is encrypted in the first > kernel, the kdump kernel will need to copy the content of IOMMU device table > from the first kernel when the kdump kernel boot, so the IOMMU device table > will > be remapped in encryption manners. > So some of them require to be remapped in encryption manners, and > some(address) > require to be remapped in decryption manners. > There could be some misunderstanding here. From the code copy_device_table in amd_iommu_init.c, iommu table entry is retrieved by read mmio address, then use memremap to map the entry address for copying the device table, so the thing related to your patch is the dev table entry address not the mmio address. As for why need copy the old dev table, I think it is for addressing on-flight DMA issue, just like the git log of below commit although the commit is for Intel IOMMU but I think AMD IOMMU solution is similar: commit 091d42e43d21b6ca7ec39bf5f9e17bc0bd8d4312 Author: Joerg Roedel Date: Fri Jun 12 11:56:10 2015 +0200 iommu/vt-d: Copy translation tables from old kernel If we are in a kdump kernel and find translation enabled in the iommu, try to copy the translation tables from the old kernel to preserve the mappings until the device driver takes over. Baoquan knows more about this I think he can correct if I'm wrong. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/4 V3] Help to dump the old memory encrypted into vmcore file
On 06/16/18 at 04:27pm, Lianbo Jiang wrote: > In kdump mode, we need to dump the old memory into vmcore file, > if SME is enabled in the first kernel, we must remap the old > memory in encrypted manner, which will be automatically decrypted > when we read from DRAM. It helps to parse the vmcore for some tools. > > Signed-off-by: Lianbo Jiang > --- > Some changes: > 1. add a new file and modify Makefile. > 2. remove some code in sev_active(). > > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/crash_dump_encrypt.c | 53 > > fs/proc/vmcore.c | 20 ++ > include/linux/crash_dump.h | 11 > 4 files changed, 79 insertions(+), 6 deletions(-) > create mode 100644 arch/x86/kernel/crash_dump_encrypt.c > > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile > index 02d6f5c..afb5bad 100644 > --- a/arch/x86/kernel/Makefile > +++ b/arch/x86/kernel/Makefile > @@ -96,6 +96,7 @@ obj-$(CONFIG_KEXEC_CORE)+= machine_kexec_$(BITS).o > obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o > obj-$(CONFIG_KEXEC_FILE) += kexec-bzimage64.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o > +obj-$(CONFIG_AMD_MEM_ENCRYPT)+= crash_dump_encrypt.o > obj-y+= kprobes/ > obj-$(CONFIG_MODULES)+= module.o > obj-$(CONFIG_DOUBLEFAULT)+= doublefault.o > diff --git a/arch/x86/kernel/crash_dump_encrypt.c > b/arch/x86/kernel/crash_dump_encrypt.c > new file mode 100644 > index 000..e44ef33 > --- /dev/null > +++ b/arch/x86/kernel/crash_dump_encrypt.c > @@ -0,0 +1,53 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Memory preserving reboot related code. > + * > + * Created by: Lianbo Jiang (liji...@redhat.com) > + * Copyright (C) RedHat Corporation, 2018. All rights reserved > + */ > + > +#include > +#include > +#include > +#include > + > +/** > + * copy_oldmem_page_encrypted - copy one page from "oldmem encrypted" > + * @pfn: page frame number to be copied > + * @buf: target memory address for the copy; this can be in kernel address > + * space or user address space (see @userbuf) > + * @csize: number of bytes to copy > + * @offset: offset in bytes into the page (based on pfn) to begin the copy > + * @userbuf: if set, @buf is in user address space, use copy_to_user(), > + * otherwise @buf is in kernel address space, use memcpy(). > + * > + * Copy a page from "oldmem encrypted". For this page, there is no pte > + * mapped in the current kernel. We stitch up a pte, similar to > + * kmap_atomic. > + */ > + > +ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char *buf, > + size_t csize, unsigned long offset, int userbuf) > +{ > + void *vaddr; > + > + if (!csize) > + return 0; > + > + vaddr = (__force void *)ioremap_encrypted(pfn << PAGE_SHIFT, > + PAGE_SIZE); > + if (!vaddr) > + return -ENOMEM; > + > + if (userbuf) { > + if (copy_to_user((void __user *)buf, vaddr + offset, csize)) { > + iounmap((void __iomem *)vaddr); > + return -EFAULT; > + } > + } else > + memcpy(buf, vaddr + offset, csize); > + > + set_iounmap_nonlazy(); > + iounmap((void __iomem *)vaddr); > + return csize; > +} > diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c > index a45f0af..5200266 100644 > --- a/fs/proc/vmcore.c > +++ b/fs/proc/vmcore.c > @@ -25,6 +25,8 @@ > #include > #include > #include "internal.h" > +#include > +#include > > /* List representing chunks of contiguous memory areas and their offsets in > * vmcore file. > @@ -86,7 +88,8 @@ static int pfn_is_ram(unsigned long pfn) > > /* Reads a page from the oldmem device from given offset. */ > static ssize_t read_from_oldmem(char *buf, size_t count, > - u64 *ppos, int userbuf) > + u64 *ppos, int userbuf, > + bool encrypted) > { > unsigned long pfn, offset; > size_t nr_bytes; > @@ -108,8 +111,11 @@ static ssize_t read_from_oldmem(char *buf, size_t count, > if (pfn_is_ram(pfn) == 0) > memset(buf, 0, nr_bytes); > else { > - tmp = copy_oldmem_page(pfn, buf, nr_bytes, > - offset, userbuf); > + tmp = encrypted ? copy_oldmem_page_encrypted(pfn, > + buf, nr_bytes, offset, userbuf) > + : copy_oldmem_page(pfn, buf, nr_bytes, > +offset, userbuf); > + > if (tmp < 0) > return tmp; > } > @@ -143,7 +149,7 @@ void __weak elfcorehdr_free(unsigned long long addr) > */ >
Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption
Ccing Xunlei he is reading the patches see what need to be done for kdump. There should still be several places to handle to make kdump work. On 05/18/17 at 07:01pm, Borislav Petkov wrote: > On Tue, Apr 18, 2017 at 04:22:12PM -0500, Tom Lendacky wrote: > > Add sysfs support for SME so that user-space utilities (kdump, etc.) can > > determine if SME is active. > > But why do user-space tools need to know that? > > I mean, when we load the kdump kernel, we do it with the first kernel, > with the kexec_load() syscall, AFAICT. And that code does a lot of > things during that init, like machine_kexec_prepare()->init_pgtable() to > prepare the ident mapping of the second kernel, for example. > > What I'm aiming at is that the first kernel knows *exactly* whether SME > is enabled or not and doesn't need to tell the second one through some > sysfs entries - it can do that during loading. > > So I don't think we need any userspace things at all... If kdump kernel can get the SME status from hardware register then this should be not necessary and this patch can be dropped. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption
On 04/27/17 at 08:52am, Dave Hansen wrote: > On 04/27/2017 12:25 AM, Dave Young wrote: > > On 04/21/17 at 02:55pm, Dave Hansen wrote: > >> On 04/18/2017 02:22 PM, Tom Lendacky wrote: > >>> Add sysfs support for SME so that user-space utilities (kdump, etc.) can > >>> determine if SME is active. > >>> > >>> A new directory will be created: > >>> /sys/kernel/mm/sme/ > >>> > >>> And two entries within the new directory: > >>> /sys/kernel/mm/sme/active > >>> /sys/kernel/mm/sme/encryption_mask > >> > >> Why do they care, and what will they be doing with this information? > > > > Since kdump will copy old memory but need this to know if the old memory > > was encrypted or not. With this sysfs file we can know the previous SME > > status and pass to kdump kernel as like a kernel param. > > > > Tom, have you got chance to try if it works or not? > > What will the kdump kernel do with it though? We kexec() into that > kernel so the SME keys will all be the same, right? So, will the kdump > kernel be just setting the encryption bit in the PTE so it can copy the > old plaintext out? I assume it is for active -> non active case, the new boot need to know the old memory is encrypted. But I think I did not read all the patches I may miss things. > > Why do we need both 'active' and 'encryption_mask'? How could it be > that the hardware-enumerated 'encryption_mask' changes across a kexec()? Leave this question to Tom.. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption
On 04/21/17 at 02:55pm, Dave Hansen wrote: > On 04/18/2017 02:22 PM, Tom Lendacky wrote: > > Add sysfs support for SME so that user-space utilities (kdump, etc.) can > > determine if SME is active. > > > > A new directory will be created: > > /sys/kernel/mm/sme/ > > > > And two entries within the new directory: > > /sys/kernel/mm/sme/active > > /sys/kernel/mm/sme/encryption_mask > > Why do they care, and what will they be doing with this information? Since kdump will copy old memory but need this to know if the old memory was encrypted or not. With this sysfs file we can know the previous SME status and pass to kdump kernel as like a kernel param. Tom, have you got chance to try if it works or not? Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 25/28] x86: Access the setup data through sysfs decrypted
On 02/16/17 at 09:47am, Tom Lendacky wrote: > Use memremap() to map the setup data. This will make the appropriate > decision as to whether a RAM remapping can be done or if a fallback to > ioremap_cache() is needed (similar to the setup data debugfs support). > > Signed-off-by: Tom Lendacky <thomas.lenda...@amd.com> > --- > arch/x86/kernel/ksysfs.c | 27 ++- > 1 file changed, 14 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kernel/ksysfs.c b/arch/x86/kernel/ksysfs.c > index 4afc67f..d653b3e 100644 > --- a/arch/x86/kernel/ksysfs.c > +++ b/arch/x86/kernel/ksysfs.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -79,12 +80,12 @@ static int get_setup_data_paddr(int nr, u64 *paddr) > *paddr = pa_data; > return 0; > } > - data = ioremap_cache(pa_data, sizeof(*data)); > + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB); > if (!data) > return -ENOMEM; > > pa_data = data->next; > - iounmap(data); > + memunmap(data); > i++; > } > return -EINVAL; > @@ -97,17 +98,17 @@ static int __init get_setup_data_size(int nr, size_t > *size) > u64 pa_data = boot_params.hdr.setup_data; > > while (pa_data) { > - data = ioremap_cache(pa_data, sizeof(*data)); > + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB); > if (!data) > return -ENOMEM; > if (nr == i) { > *size = data->len; > - iounmap(data); > + memunmap(data); > return 0; > } > > pa_data = data->next; > - iounmap(data); > + memunmap(data); > i++; > } > return -EINVAL; > @@ -127,12 +128,12 @@ static ssize_t type_show(struct kobject *kobj, > ret = get_setup_data_paddr(nr, ); > if (ret) > return ret; > - data = ioremap_cache(paddr, sizeof(*data)); > + data = memremap(paddr, sizeof(*data), MEMREMAP_WB); > if (!data) > return -ENOMEM; > > ret = sprintf(buf, "0x%x\n", data->type); > - iounmap(data); > + memunmap(data); > return ret; > } > > @@ -154,7 +155,7 @@ static ssize_t setup_data_data_read(struct file *fp, > ret = get_setup_data_paddr(nr, ); > if (ret) > return ret; > - data = ioremap_cache(paddr, sizeof(*data)); > + data = memremap(paddr, sizeof(*data), MEMREMAP_WB); > if (!data) > return -ENOMEM; > > @@ -170,15 +171,15 @@ static ssize_t setup_data_data_read(struct file *fp, > goto out; > > ret = count; > - p = ioremap_cache(paddr + sizeof(*data), data->len); > + p = memremap(paddr + sizeof(*data), data->len, MEMREMAP_WB); > if (!p) { > ret = -ENOMEM; > goto out; > } > memcpy(buf, p + off, count); > - iounmap(p); > + memunmap(p); > out: > - iounmap(data); > + memunmap(data); > return ret; > } > > @@ -250,13 +251,13 @@ static int __init get_setup_data_total_num(u64 pa_data, > int *nr) > *nr = 0; > while (pa_data) { > *nr += 1; > - data = ioremap_cache(pa_data, sizeof(*data)); > + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB); > if (!data) { > ret = -ENOMEM; > goto out; > } > pa_data = data->next; > - iounmap(data); > + memunmap(data); > } > > out: > It would be better that these cleanup patches are sent separately. Acked-by: Dave Young <dyo...@redhat.com> Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 24/28] x86: Access the setup data through debugfs decrypted
On 02/16/17 at 09:47am, Tom Lendacky wrote: > Use memremap() to map the setup data. This simplifies the code and will > make the appropriate decision as to whether a RAM remapping can be done > or if a fallback to ioremap_cache() is needed (which includes checking > PageHighMem). > > Signed-off-by: Tom Lendacky> --- > arch/x86/kernel/kdebugfs.c | 30 +++--- > 1 file changed, 11 insertions(+), 19 deletions(-) > > diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c > index bdb83e4..c3d354d 100644 > --- a/arch/x86/kernel/kdebugfs.c > +++ b/arch/x86/kernel/kdebugfs.c > @@ -48,17 +48,13 @@ static ssize_t setup_data_read(struct file *file, char > __user *user_buf, > > pa = node->paddr + sizeof(struct setup_data) + pos; > pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT); > - if (PageHighMem(pg)) { > - p = ioremap_cache(pa, count); > - if (!p) > - return -ENXIO; > - } else > - p = __va(pa); > + p = memremap(pa, count, MEMREMAP_WB); > + if (!p) > + return -ENXIO; -ENOMEM looks better for memremap, ditto for other places.. > > remain = copy_to_user(user_buf, p, count); > > - if (PageHighMem(pg)) > - iounmap(p); > + memunmap(p); > > if (remain) > return -EFAULT; > @@ -127,15 +123,12 @@ static int __init create_setup_data_nodes(struct dentry > *parent) > } > > pg = pfn_to_page((pa_data+sizeof(*data)-1) >> PAGE_SHIFT); > - if (PageHighMem(pg)) { > - data = ioremap_cache(pa_data, sizeof(*data)); > - if (!data) { > - kfree(node); > - error = -ENXIO; > - goto err_dir; > - } > - } else > - data = __va(pa_data); > + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB); > + if (!data) { > + kfree(node); > + error = -ENXIO; > + goto err_dir; > + } > > node->paddr = pa_data; > node->type = data->type; > @@ -143,8 +136,7 @@ static int __init create_setup_data_nodes(struct dentry > *parent) > error = create_setup_data_node(d, no, node); > pa_data = data->next; > > - if (PageHighMem(pg)) > - iounmap(data); > + memunmap(data); > if (error) > goto err_dir; > no++; > Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 14/28] Add support to access boot related data in the clear
On 02/16/17 at 09:45am, Tom Lendacky wrote: [snip] > + * This function determines if an address should be mapped encrypted. > + * Boot setup data, EFI data and E820 areas are checked in making this > + * determination. > + */ > +static bool memremap_should_map_encrypted(resource_size_t phys_addr, > + unsigned long size) > +{ > + /* > + * SME is not active, return true: > + * - For early_memremap_pgprot_adjust(), returning true or false > + * results in the same protection value > + * - For arch_memremap_do_ram_remap(), returning true will allow > + * the RAM remap to occur instead of falling back to ioremap() > + */ > + if (!sme_active()) > + return true; >From the function name shouldn't above be return false? > + > + /* Check if the address is part of the setup data */ > + if (memremap_is_setup_data(phys_addr, size)) > + return false; > + > + /* Check if the address is part of EFI boot/runtime data */ > + switch (efi_mem_type(phys_addr)) { > + case EFI_BOOT_SERVICES_DATA: > + case EFI_RUNTIME_SERVICES_DATA: Only these two types needed? I'm not sure about this, just bring up the question. > + return false; > + default: > + break; > + } > + > + /* Check if the address is outside kernel usable area */ > + switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) { > + case E820_TYPE_RESERVED: > + case E820_TYPE_ACPI: > + case E820_TYPE_NVS: > + case E820_TYPE_UNUSABLE: > + return false; > + default: > + break; > + } > + > + return true; > +} > + Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME
Add kexec list.. On 03/01/17 at 05:25pm, Dave Young wrote: > Hi Tom, > > On 02/17/17 at 10:43am, Tom Lendacky wrote: > > On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote: > > > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote: > > > > Provide support so that kexec can be used to boot a kernel when SME is > > > > enabled. > > > > > > Is the point of kexec and kdump to ehh, dump memory ? But if the > > > rest of the memory is encrypted you won't get much, will you? > > > > Kexec can be used to reboot a system without going back through BIOS. > > So you can use kexec without using kdump. > > > > For kdump, just taking a quick look, the option to enable memory > > encryption can be provided on the crash kernel command line and then > > Is there a simple way to get the SME status? Probably add some sysfs > file for this purpose. > > > crash kernel can would be able to copy the memory decrypted if the > > pagetable is set up properly. It looks like currently ioremap_cache() > > is used to map the old memory page. That might be able to be changed > > to a memremap() so that the encryption bit is set in the mapping. That > > will mean that memory that is not marked encrypted (EFI tables, swiotlb > > memory, etc) would not be read correctly. > > Manage to store info about those ranges which are not encrypted so that > memremap can handle them? > > > > > > > > > Would it make sense to include some printk to the user if they > > > are setting up kdump that they won't get anything out of it? > > > > Probably a good idea to add something like that. > > It will break kdump functionality, it should be fixed instead of > just adding printk to warn user.. > > Thanks > Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME
Hi Tom, On 02/17/17 at 10:43am, Tom Lendacky wrote: > On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote: > > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote: > > > Provide support so that kexec can be used to boot a kernel when SME is > > > enabled. > > > > Is the point of kexec and kdump to ehh, dump memory ? But if the > > rest of the memory is encrypted you won't get much, will you? > > Kexec can be used to reboot a system without going back through BIOS. > So you can use kexec without using kdump. > > For kdump, just taking a quick look, the option to enable memory > encryption can be provided on the crash kernel command line and then Is there a simple way to get the SME status? Probably add some sysfs file for this purpose. > crash kernel can would be able to copy the memory decrypted if the > pagetable is set up properly. It looks like currently ioremap_cache() > is used to map the old memory page. That might be able to be changed > to a memremap() so that the encryption bit is set in the mapping. That > will mean that memory that is not marked encrypted (EFI tables, swiotlb > memory, etc) would not be read correctly. Manage to store info about those ranges which are not encrypted so that memremap can handle them? > > > > > Would it make sense to include some printk to the user if they > > are setting up kdump that they won't get anything out of it? > > Probably a good idea to add something like that. It will break kdump functionality, it should be fixed instead of just adding printk to warn user.. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
Hi, On 05/18/15 at 06:05pm, Li, ZhenHua wrote: Hi Joerg, Testing from HP: passed. Testing from He Baoquan(Redhat): passed The problem that dmar fault came again when running v10 with latest kernel is fixed. And I think there is no need to update the code to a new version now. So please tell me if there are still any code need to be changed. Have you reviewed all my comments? Although it is only iommu driver fixes, The patches are sent to lkml, no? If you do not want other people's comments why will you send to us? I would say do not assume it is a closed circle for only iommu driver. It is not easy to review patches actually, frankly I feel kernel code review is not so strict like the early days, maybe one reason is people are getting tired to argue. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 05/10] iommu/vt-d: Add functions to load and save old re
On 05/13/15 at 09:47am, Li, ZhenHua wrote: On 05/12/2015 04:37 PM, Dave Young wrote: Seems the subject was truncated? Maybe re means root entry? Then please fix it On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Add functions to load root entry table from old kernel, and to save updated root entry table. Add two member in struct intel_iommu, to store the RTA in old kernel, and the mapped virt address of it. Please explain a bit what is RTA in patch log, it is unclear to most of people who do not know iommu details. If people want to understand this patchset, I assume they have read the vt-d specs. RTA is defined clearly in this spec. I think explain a bit is better, it will be easier for review, RTA is defined in spec, right, but if you refer to it in kernel source code, describe the meaning will help people to understand your code. I would not stick on this 'RTA', but a descriptive subject and patch log is still important, please check the logs, not only for this patch. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 02/10] iommu/vt-d: Items required for kdump
On 05/13/15 at 09:45am, Li, ZhenHua wrote: On 05/12/2015 04:17 PM, Dave Young wrote: On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Add context entry functions needed for kdump. +/* + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU + * + * Fixes the crashdump kernel to deal with an active iommu and legacy + * DMA from the (old) panicked kernel in a manner similar to how legacy + * DMA is handled when no hardware iommu was in use by the old kernel -- + * allow the legacy DMA to continue into its current buffers. + * + * In the crashdump kernel, this code: + * 1. skips disabling the IOMMU's translating. + * 2. Do not re-enable IOMMU's translating. + * 3. In kdump kernel, use the old root entry table. + * 4. Allocate pages for new context entry, copy data from old context entries + *in the old kernel to the new ones. + * + * In other kinds of kernel, for example, a kernel started by kexec, + * do the same thing as crashdump kernel. + */ + + Above comments should come along with the code changes instead of putting it in this patch. BTW, there's one more blank line at the end.. Code change is in 00/10, the cover letter. I means the real code, not the changelog. And the blank does not matter, I checked the patch with scripts/checkpatch.pl, no warnings, no errors. Why add two line breaks if one is enough? Adding such check in checkpatch.pl make sense to me actually. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 05/10] iommu/vt-d: Add functions to load and save old re
Seems the subject was truncated? Maybe re means root entry? Then please fix it On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Add functions to load root entry table from old kernel, and to save updated root entry table. Add two member in struct intel_iommu, to store the RTA in old kernel, and the mapped virt address of it. Please explain a bit what is RTA in patch log, it is unclear to most of people who do not know iommu details. We use the old RTA in dump kernel, and when the iommu-root_entry is used as a cache in kdump kernel, its phys address will not be save to RTA register, but when its data is changed, we will save the new data to old root entry table. Li, Zhen-hua: The functions and logics. Takao Indoh: Add __iommu_flush_cache. Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com --- drivers/iommu/intel-iommu.c | 54 - include/linux/intel-iommu.h | 3 +++ 2 files changed, 56 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 0b97c15..3a5d446 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -371,6 +371,10 @@ static struct context_entry *device_to_existing_context_entry( struct intel_iommu *iommu, u8 bus, u8 devfn); +static void __iommu_load_old_root_entry(struct intel_iommu *iommu); + +static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index); + /* * A structure used to store the address allocated by ioremap(); * The we need to call iounmap() to free them out of spin_lock_irqsave/unlock; @@ -382,7 +386,6 @@ struct iommu_remapped_entry { static LIST_HEAD(__iommu_remapped_mem); static DEFINE_MUTEX(__iommu_mem_list_lock); - /* * This domain is a statically identity mapping domain. * 1. This domain creats a static 1:1 mapping to all usable memory. @@ -4935,3 +4938,52 @@ int __iommu_free_mapped_mem(void) return 0; } +/* + * Load the old root entry table to new root entry table. + */ +static void __iommu_load_old_root_entry(struct intel_iommu *iommu) +{ + if ((!iommu) + || (!iommu-root_entry) + || (!iommu-root_entry_old_virt) + || (!iommu-root_entry_old_phys)) + return; + memcpy(iommu-root_entry, iommu-root_entry_old_virt, PAGE_SIZE); + + __iommu_flush_cache(iommu, iommu-root_entry, PAGE_SIZE); +} + +/* + * When the data in new root entry table is changed, this function + * must be called to save the updated data to old root entry table. + */ +static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index) +{ + u8 start; + unsigned long size; + void __iomem *to; + void *from; + + if ((!iommu) + || (!iommu-root_entry) + || (!iommu-root_entry_old_virt) + || (!iommu-root_entry_old_phys)) + return; + + if (index -1 || index = ROOT_ENTRY_NR) + return; + + if (index == -1) { + start = 0; + size = ROOT_ENTRY_NR * sizeof(struct root_entry); + } else { + start = index * sizeof(struct root_entry); + size = sizeof(struct root_entry); + } + to = iommu-root_entry_old_virt; + from = iommu-root_entry; + memcpy(to + start, from + start, size); + + __iommu_flush_cache(iommu, to + start, size); +} + diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index ced1fac..e7cac12 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -340,6 +340,9 @@ struct intel_iommu { spinlock_t lock; /* protect context, domain ids */ struct root_entry *root_entry; /* virtual address */ + void __iomem*root_entry_old_virt; /* mapped from old root entry */ + unsigned long root_entry_old_phys; /* root entry in old kernel */ + struct iommu_flush flush; #endif struct q_inval *qi;/* Queued invalidation info */ -- 2.0.0-rc0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 02/10] iommu/vt-d: Items required for kdump
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Add context entry functions needed for kdump. Bill Sumner: Original version; Li, Zhenhua: Changed the name of new functions, make them consistent with current context get/set functions. Remove the structure dve which is not used in new version. Signed-off-by: Bill Sumner billsumnerli...@gmail.com Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com --- drivers/iommu/intel-iommu.c | 72 + 1 file changed, 72 insertions(+) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index cb9d6cc..1e7ceb5 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -190,6 +190,31 @@ struct root_entry { }; #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry)) +static inline bool root_present(struct root_entry *root) +{ + return (root-lo 1); +} + +static inline void set_root_value(struct root_entry *root, unsigned long value) +{ + root-lo = ~VTD_PAGE_MASK; + root-lo |= value VTD_PAGE_MASK; +} + +static inline struct context_entry * +get_context_addr_from_root(struct root_entry *root) +{ + return (struct context_entry *) + (root_present(root)?phys_to_virt( + root-lo VTD_PAGE_MASK) : + NULL); +} + +static inline unsigned long +get_context_phys_from_root(struct root_entry *root) +{ + return root_present(root) ? (root-lo VTD_PAGE_MASK) : 0; +} /* * low 64 bits: @@ -211,6 +236,32 @@ static inline bool context_present(struct context_entry *context) { return (context-lo 1); } + +static inline int context_fault_enable(struct context_entry *c) +{ + return((c-lo 1) 0x1); +} + +static inline int context_translation_type(struct context_entry *c) +{ + return((c-lo 2) 0x3); +} + +static inline u64 context_address_root(struct context_entry *c) +{ + return((c-lo VTD_PAGE_SHIFT)); +} + +static inline int context_address_width(struct context_entry *c) +{ + return((c-hi 0) 0x7); +} + +static inline int context_domain_id(struct context_entry *c) +{ + return((c-hi 8) 0x); +} + static inline void context_set_present(struct context_entry *context) { context-lo |= 1; @@ -296,6 +347,27 @@ static inline int first_pte_in_page(struct dma_pte *pte) return !((unsigned long)pte ~VTD_PAGE_MASK); } + +/* + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU + * + * Fixes the crashdump kernel to deal with an active iommu and legacy + * DMA from the (old) panicked kernel in a manner similar to how legacy + * DMA is handled when no hardware iommu was in use by the old kernel -- + * allow the legacy DMA to continue into its current buffers. + * + * In the crashdump kernel, this code: + * 1. skips disabling the IOMMU's translating. + * 2. Do not re-enable IOMMU's translating. + * 3. In kdump kernel, use the old root entry table. + * 4. Allocate pages for new context entry, copy data from old context entries + *in the old kernel to the new ones. + * + * In other kinds of kernel, for example, a kernel started by kexec, + * do the same thing as crashdump kernel. + */ + + Above comments should come along with the code changes instead of putting it in this patch. BTW, there's one more blank line at the end.. /* * This domain is a statically identity mapping domain. * 1. This domain creats a static 1:1 mapping to all usable memory. -- 2.0.0-rc0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 04/10] iommu/vt-d: functions to copy data from old mem
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Add some functions to copy the data from old kernel. These functions are used to copy context tables and page tables. To avoid calling iounmap between spin_lock_irqsave and spin_unlock_irqrestore, use a link here, store the pointers , and then use iounmap to free them in another place. Li, Zhen-hua: The functions and logics. Takao Indoh: Check if pfn is ram: if (page_is_ram(pfn)) Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com Signed-off-by: Takao Indoh indou.ta...@jp.fujitsu.com --- drivers/iommu/intel-iommu.c | 102 include/linux/intel-iommu.h | 6 +++ 2 files changed, 108 insertions(+) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 07e6118..0b97c15 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -371,6 +371,17 @@ static struct context_entry *device_to_existing_context_entry( struct intel_iommu *iommu, u8 bus, u8 devfn); +/* + * A structure used to store the address allocated by ioremap(); + * The we need to call iounmap() to free them out of spin_lock_irqsave/unlock; + */ +struct iommu_remapped_entry { + struct list_head list; + void __iomem *mem; +}; +static LIST_HEAD(__iommu_remapped_mem); +static DEFINE_MUTEX(__iommu_mem_list_lock); + /* * This domain is a statically identity mapping domain. @@ -4833,3 +4844,94 @@ static struct context_entry *device_to_existing_context_entry( return ret; } +/* + * Copy memory from a physically-addressed area into a virtually-addressed area + */ +int __iommu_load_from_oldmem(void *to, unsigned long from, unsigned long size) +{ + unsigned long pfn; /* Page Frame Number */ + size_t csize = (size_t)size;/* Num(bytes to copy) */ + unsigned long offset; /* Lower 12 bits of to */ comment for above variable are unnecessary, they are clear enough. BTW, use size_t size in function argument is better. + void __iomem *virt_mem; + struct iommu_remapped_entry *mapped; + + pfn = from VTD_PAGE_SHIFT; + offset = from (~VTD_PAGE_MASK); + + if (page_is_ram(pfn)) { + memcpy(to, pfn_to_kaddr(pfn) + offset, csize); + } else{ + + mapped = kzalloc(sizeof(struct iommu_remapped_entry), + GFP_KERNEL); + if (!mapped) + return -ENOMEM; + + virt_mem = ioremap_cache((unsigned long)from, size); + if (!virt_mem) { + kfree(mapped); + return -ENOMEM; + } + memcpy(to, virt_mem, size); csize or size? as previous comment use size_t in argument it will be ok here. + + mutex_lock(__iommu_mem_list_lock); + mapped-mem = virt_mem; + list_add_tail(mapped-list, __iommu_remapped_mem); + mutex_unlock(__iommu_mem_list_lock); + } + return size; type mismatch? +} + +/* + * Copy memory from a virtually-addressed area into a physically-addressed area + */ +int __iommu_save_to_oldmem(unsigned long to, void *from, unsigned long size) +{ + unsigned long pfn; /* Page Frame Number */ + size_t csize = (size_t)size;/* Num(bytes to copy) */ + unsigned long offset; /* Lower 12 bits of to */ Ditto about data type and comments.. + void __iomem *virt_mem; + struct iommu_remapped_entry *mapped; + + pfn = to VTD_PAGE_SHIFT; + offset = to (~VTD_PAGE_MASK); + + if (page_is_ram(pfn)) { + memcpy(pfn_to_kaddr(pfn) + offset, from, csize); + } else{ + mapped = kzalloc(sizeof(struct iommu_remapped_entry), + GFP_KERNEL); + if (!mapped) + return -ENOMEM; + + virt_mem = ioremap_cache((unsigned long)to, size); + if (!virt_mem) { + kfree(mapped); + return -ENOMEM; + } + memcpy(virt_mem, from, size); + mutex_lock(__iommu_mem_list_lock); + mapped-mem = virt_mem; + list_add_tail(mapped-list, __iommu_remapped_mem); + mutex_unlock(__iommu_mem_list_lock); + } + return size; +} Above two functions looks duplicate, can they be merged as one function and add a argument about direction? + +/* + * Free the mapped memory for ioremap; + */ +int __iommu_free_mapped_mem(void) +{ + struct iommu_remapped_entry *mem_entry, *tmp; + + mutex_lock(__iommu_mem_list_lock); + list_for_each_entry_safe(mem_entry, tmp, __iommu_remapped_mem, list) { + iounmap(mem_entry-mem); + list_del(mem_entry-list); + kfree(mem_entry); + } +
Re: [PATCH v11 06/10] iommu/vt-d: datatypes and functions used for kdump
The patch subject is bad, previous patch you use Items for kdump, this patch you use datatypes and functions used for kdump then what's the difference between these two patches? Please think about a better one for these patches.. On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Populate it with support functions to copy iommu translation tables from from the panicked kernel into the kdump kernel in the event of a crash. Functions: Use old root entry table, and load the old data to root_entry as cache. Malloc new context table and copy old context table to the new one. Bill Sumner: Original version, the creation of the data types and functions. Li, Zhenhua: Create new function iommu_check_pre_te_status() to check status. Update the caller of context_get_* and context_put*, use context_* and context_set_* for replacement. Update the name of the function that loads root entry table. Use new function to copy old context entry tables and page tables. Use unsigned long for physical address. Remove the functions to copy page table in Bill's version. Remove usage of dve and ppap in Bill's version. Signed-off-by: Bill Sumner billsumnerli...@gmail.com Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com --- drivers/iommu/intel-iommu.c | 121 include/linux/intel-iommu.h | 3 ++ 2 files changed, 124 insertions(+) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 3a5d446..28c3c64 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -386,6 +386,18 @@ struct iommu_remapped_entry { static LIST_HEAD(__iommu_remapped_mem); static DEFINE_MUTEX(__iommu_mem_list_lock); +/* Please remove the .. + * Copy iommu translation tables from old kernel into new kernel. One more space between new and kernel + * Entry to this set of functions is: intel_iommu_load_translation_tables() + * Drop above --- line is better + */ + +static int copy_root_entry_table(struct intel_iommu *iommu); + +static int intel_iommu_load_translation_tables(struct intel_iommu *iommu); + +static void iommu_check_pre_te_status(struct intel_iommu *iommu); + /* * This domain is a statically identity mapping domain. * 1. This domain creats a static 1:1 mapping to all usable memory. @@ -4987,3 +4999,112 @@ static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index) __iommu_flush_cache(iommu, to + start, size); } +/* + * Load root entry tables from old kernel. + */ +static int copy_root_entry_table(struct intel_iommu *iommu) +{ + u32 bus;/* Index: root-entry-table */ + struct root_entry *re; /* Virt(iterator: new table) */ + unsigned long context_old_phys; /* Phys(context table entry) */ + struct context_entry *context_new_virt; /* Virt(new context_entry) */ + + /* + * A new root entry table has been allocated , One more space before ,' + * we need copy re from old kernel to the new allocated one. + */ + + if (!iommu-root_entry_old_phys) + return -ENOMEM; + + for (bus = 0, re = iommu-root_entry; bus 256; bus += 1, re += 1) { + if (!root_present(re)) + continue; + + context_old_phys = get_context_phys_from_root(re); + + if (!context_old_phys) + continue; + + context_new_virt = + (struct context_entry *)alloc_pgtable_page(iommu-node); + + if (!context_new_virt) + return -ENOMEM; + + __iommu_load_from_oldmem(context_new_virt, + context_old_phys, + VTD_PAGE_SIZE); + + __iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE); + + set_root_value(re, virt_to_phys(context_new_virt)); + } + + return 0; +} + +/* + * Interface to the load translation tables set of functions + * from mainline code. + */ +static int intel_iommu_load_translation_tables(struct intel_iommu *iommu) +{ + unsigned long long q; /* quadword scratch */ + int ret = 0;/* Integer return code */ comment for ret is not necessary, quadword is also unnecessary, just explain the purpose of 'q' is enough. + unsigned long flags; + + q = dmar_readq(iommu-reg + DMAR_RTADDR_REG); + if (!q) + return -1; + + spin_lock_irqsave(iommu-lock, flags); + + /* Load the root-entry table from the old kernel + * foreach context_entry_table in root_entry + * Copy each entry table from old kernel + */ + if
Re: [PATCH v11 07/10] iommu/vt-d: enable kdump support in iommu module
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: Modify the operation of the following functions when called during crash dump: iommu_context_addr free_context_table get_domain_for_dev init_dmars intel_iommu_init Bill Sumner: Original version. Zhenhua: The name of new calling functions. Do not disable and re-enable TE in kdump kernel. Use the did and gaw from old context entry; Signed-off-by: Bill Sumner billsumnerli...@gmail.com Signed-off-by: Li, Zhen-Hua zhen-h...@hp.com --- drivers/iommu/intel-iommu.c | 95 +++-- 1 file changed, 83 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 28c3c64..91545bf 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -397,6 +397,7 @@ static int copy_root_entry_table(struct intel_iommu *iommu); static int intel_iommu_load_translation_tables(struct intel_iommu *iommu); static void iommu_check_pre_te_status(struct intel_iommu *iommu); +static u8 g_translation_pre_enabled; /* * This domain is a statically identity mapping domain. @@ -794,6 +795,9 @@ static inline struct context_entry *iommu_context_addr(struct intel_iommu *iommu phy_addr = virt_to_phys((void *)context); *entry = phy_addr | 1; __iommu_flush_cache(iommu, entry, sizeof(*entry)); + + if (iommu-pre_enabled_trans) + __iommu_update_old_root_entry(iommu, bus); } return context[devfn]; } @@ -887,13 +891,15 @@ static void clear_context_table(struct intel_iommu *iommu, u8 bus, u8 devfn) static void free_context_table(struct intel_iommu *iommu) { + struct root_entry *root = NULL; int i; unsigned long flags; struct context_entry *context; spin_lock_irqsave(iommu-lock, flags); if (!iommu-root_entry) { - goto out; + spin_unlock_irqrestore(iommu-lock, flags); + return; } for (i = 0; i ROOT_ENTRY_NR; i++) { context = iommu_context_addr(iommu, i, 0, 0); @@ -908,10 +914,23 @@ static void free_context_table(struct intel_iommu *iommu) free_pgtable_page(context); } + + if (iommu-pre_enabled_trans) { + iommu-root_entry_old_phys = 0; + root = iommu-root_entry_old_virt; + iommu-root_entry_old_virt = NULL; + } + free_pgtable_page(iommu-root_entry); iommu-root_entry = NULL; -out: + spin_unlock_irqrestore(iommu-lock, flags); + + /* We put this out of spin_unlock is because iounmap() may + * cause error if surrounded by spin_lock and unlock; + */ + if (iommu-pre_enabled_trans) + iounmap(root); } static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain, @@ -2333,6 +2352,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw) unsigned long flags; u8 bus, devfn; int did = -1; /* Default to no domain_id supplied */ + struct context_entry *ce = NULL; domain = find_domain(dev); if (domain) @@ -2366,6 +2386,20 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw) domain = alloc_domain(0); if (!domain) return NULL; + + if (iommu-pre_enabled_trans) { + /* + * if this device had a did in the old kernel + * use its values instead of generating new ones + */ + ce = device_to_existing_context_entry(iommu, bus, devfn); + + if (ce) { + did = context_domain_id(ce); + gaw = agaw_to_width(context_address_width(ce)); + } + } + domain-id = iommu_attach_domain_with_id(domain, iommu, did); if (domain-id 0) { free_domain_mem(domain); @@ -2897,6 +2931,7 @@ static int __init init_dmars(void) goto free_g_iommus; } + g_translation_pre_enabled = 0; /* To know whether to skip RMRR */ for_each_active_iommu(iommu, drhd) { g_iommus[iommu-seq_id] = iommu; @@ -2904,14 +2939,30 @@ static int __init init_dmars(void) if (ret) goto free_iommu; - /* - * TBD: - * we could share the same root context tables - * among all IOMMU's. Need to Split it later. - */ - ret = iommu_alloc_root_entry(iommu); - if (ret) - goto free_iommu; + iommu_check_pre_te_status(iommu); + if (iommu-pre_enabled_trans) { + pr_info(IOMMU Copying translate tables from panicked kernel\n); + ret = intel_iommu_load_translation_tables(iommu); +
Re: [PATCH v11 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/11/15 at 05:52pm, Li, Zhen-Hua wrote: This patchset is an update of Bill Sumner's patchset, implements a fix for: If a kernel boots with intel_iommu=on on a system that supports intel vt-d, when a panic happens, the kdump kernel will boot with these faults: dmar: DRHD: handling fault status reg 102 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8 DMAR:[fault reason 01] Present bit in root entry is clear dmar: DRHD: handling fault status reg 2 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear On some system, the interrupt remapping fault will also happen even if the intel_iommu is not set to on, because the interrupt remapping will be enabled when x2apic is needed by the system. The cause of the DMA fault is described in Bill's original version, and the INTR-Remap fault is caused by a similar reason. In short, the initialization of vt-d drivers causes the in-flight DMA and interrupt requests get wrong response. To fix this problem, we modifies the behaviors of the intel vt-d in the crashdump kernel: For DMA Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the translation, keep it enabled. 3. Use the old root entry table, do not rewrite the RTA register. 4. Malloc and use new context entry table, copy data from the old ones that used by the old kernel. 5. Keep using the old page tables before driver is loaded. 6. After device driver is loaded, when it issues the first dma_map command, free the dmar_domain structure for this device, and generate a new one, so that the device can be assigned a new and empty page table. 7. When a new context entry table is generated, we also save its address to the old root entry table. For Interrupt Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the interrupt remapping, keep it enabled. 3. Use the old interrupt remapping table, do not rewrite the IRTA register. 4. When ioapic entry is setup, the interrupt remapping table is changed, and the updated data will be stored to the old interrupt remapping table. Advantages of this approach: 1. All manipulation of the IO-device is done by the Linux device-driver for that device. 2. This approach behaves in a manner very similar to operation without an active iommu. 3. Any activity between the IO-device and its RMRR areas is handled by the device-driver in the same manner as during a non-kdump boot. 4. If an IO-device has no driver in the kdump kernel, it is simply left alone. This supports the practice of creating a special kdump kernel without drivers for any devices that are not required for taking a crashdump. 5. Minimal code-changes among the existing mainline intel vt-d code. Summary of changes in this patch set: 1. Added some useful function for root entry table in code intel-iommu.c 2. Added new members to struct root_entry and struct irte; 3. Functions to load old root entry table to iommu-root_entry from the memory of old kernel. 4. Functions to malloc new context entry table and copy the data from the old ones to the malloced new ones. 5. Functions to enable support for DMA remapping in kdump kernel. 6. Functions to load old irte data from the old kernel to the kdump kernel. 7. Some code changes that support other behaviours that have been listed. 8. In the new functions, use physical address as unsigned long type, not pointers. Original version by Bill Sumner: https://lkml.org/lkml/2014/1/10/518 https://lkml.org/lkml/2014/4/15/716 https://lkml.org/lkml/2014/4/24/836 Zhenhua's updates: https://lkml.org/lkml/2014/10/21/134 https://lkml.org/lkml/2014/12/15/121 https://lkml.org/lkml/2014/12/22/53 https://lkml.org/lkml/2015/1/6/1166 https://lkml.org/lkml/2015/1/12/35 https://lkml.org/lkml/2015/3/19/33 https://lkml.org/lkml/2015/4/10/135 Changelog[v11]: 1. Fix some conflicts with the latest upstream kernel. Add flush in iommu_context_addr Changelog[v10]: 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel(). Use one flag which stores the te and ir status in last kernel: iommu-pre_enabled_trans iommu-pre_enabled_ir Changelog[v9]: 1. Add new function iommu_attach_domain_with_id. 2. Do not copy old page tables, keep using the old ones. 3. Remove functions: intel_iommu_did_to_domain_values_entry intel_iommu_get_dids_from_old_kernel device_to_domain_id copy_page_addr copy_page_table copy_context_entry copy_context_entry_table 4. Add new function device_to_existing_context_entry. Changelog[v8]: 1. Add a missing __iommu_flush_cache in function copy_page_table.
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/11/15 at 12:11pm, Joerg Roedel wrote: On Thu, May 07, 2015 at 09:56:00PM +0800, Dave Young wrote: Joreg, I can not find the last reply from you, so just reply here about my worries here. I said that the patchset will cause more problems, let me explain about it more here: Suppose page table was corrupted, ie. original mapping iova1 - page 1 it was changed to iova1 - page 2 accidently while crash happening, thus future dma will read/write page 2 instead page 1, right? When the page-table is corrupted then it is a left-over from the old kernel. When the kdump kernel boots the situation can at least not get worse. For the page tables it is also hard to detect wrong mappings (if this would be possible the hardware could already do it), so any checks we could do there are of limited use anyway. Joerg, since both of you guys do not think it is a problem I will object it any more though I still do not like reusing the old page tables. So let's leave it as a future issue. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/07/15 at 10:12am, Don Dutile wrote: On 04/06/2015 11:46 PM, Dave Young wrote: On 04/05/15 at 09:54am, Baoquan He wrote: On 04/03/15 at 05:21pm, Dave Young wrote: On 04/03/15 at 05:01pm, Li, ZhenHua wrote: Hi Dave, There may be some possibilities that the old iommu data is corrupted by some other modules. Currently we do not have a better solution for the dmar faults. But I think when this happens, we need to fix the module that corrupted the old iommu data. I once met a similar problem in normal kernel, the queue used by the qi_* functions was written again by another module. The fix was in that module, not in iommu module. It is too late, there will be no chance to save vmcore then. Also if it is possible to continue corrupt other area of oldmem because of using old iommu tables then it will cause more problems. So I think the tables at least need some verifycation before being used. Yes, it's a good thinking anout this and verification is also an interesting idea. kexec/kdump do a sha256 calculation on loaded kernel and then verify this again when panic happens in purgatory. This checks whether any code stomps into region reserved for kexec/kernel and corrupt the loaded kernel. If this is decided to do it should be an enhancement to current patchset but not a approach change. Since this patchset is going very close to point as maintainers expected maybe this can be merged firstly, then think about enhancement. After all without this patchset vt-d often raised error message, hung. It does not convince me, we should do it right at the beginning instead of introduce something wrong. I wonder why the old dma can not be remap to a specific page in kdump kernel so that it will not corrupt more memory. But I may missed something, I will looking for old threads and catch up. Thanks Dave The (only) issue is not corruption, but once the iommu is re-configured, the old, not-stopped-yet, dma engines will use iova's that will generate dmar faults, which will be enabled when the iommu is re-configured (even to a single/simple paging scheme) in the kexec kernel. Don, so if iommu is not reconfigured then these faults will not happen? Baoquan and me has a confusion below today about iommu=off/intel_iommu=off: intel_iommu_init() { ... dmar_table_init(); disable active iommu translations; if (no_iommu || dmar_disabled) goto out_free_dmar; ... } Any reason not move no_iommu check to the begining of intel_iommu_init function? Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/06/15 at 10:16am, Joerg Roedel wrote: On Wed, May 06, 2015 at 09:46:49AM +0800, Dave Young wrote: For the original problem, the key issue is dmar faults cause kdump kernel hang so that vmcore can not be saved. I do not know the reason why it hangs I think it is acceptable if kdump kernel boot ok with some dma errors.. It hangs because some devices can't handle the DMAR faults and the kdump kernel can't initialize them and will hang itself. For that it doesn't matter whether the fault was caused by a read or write request. Ok, thanks for explanation. so it explains sometimes kdump kernel boot ok with faults, sometimes it hangs instead. Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/04/15 at 01:05pm, Joerg Roedel wrote: On Fri, Apr 03, 2015 at 04:40:31PM +0800, Dave Young wrote: Have not read all the patches, but I have a question, not sure this has been answered before. Old memory is not reliable, what if the old memory get corrupted before panic? Is it safe to continue using it in 2nd kernel, I worry that it will cause problems. Yes, the old memory could be corrupted, and there are more failure cases left which we have no way of handling yet (if iommu data structures are in kdump backup areas). The question is what to do if we find some of the old data structures corrupted, hand how far should the tests go. Should we also check the page-tables, for example? I think if some of the data structures for a device are corrupted it probably already failed in the old kernel and things won't get worse in the new one. Joreg, I can not find the last reply from you, so just reply here about my worries here. I said that the patchset will cause more problems, let me explain about it more here: Suppose page table was corrupted, ie. original mapping iova1 - page 1 it was changed to iova1 - page 2 accidently while crash happening, thus future dma will read/write page 2 instead page 1, right? so the behavior changes like: originally, dmar faults happen, but kdump kernel may boot ok with these faults, and vmcore can be saved. with the patchset, dmar faults does not happen, dma translation will be handled, but dma write could corrupt unrelevant memory. This might be corner case, but who knows because kernel paniced we can not assume old page table is right. But seems you all think it is safe, but let us understand each other first then go to a solution. Today we talked with Zhenhua about the problem I think both of us are clear about the problems. Just he think it can be left as future work. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/05/15 at 05:31pm, Joerg Roedel wrote: On Tue, May 05, 2015 at 02:14:23PM +0800, Dave Young wrote: The failure is nothing different, but as I said in another reply the difference is we could use corrupted data to possiblly cause more failure. I still fail to see how things can get more worse than they already are by reusing the old data (we just reuse it, we do not modify anything DMA write will modify system ram, if the old data is corrupted it is possible that DMA operation modify wrong ram regions because of wrong mapping. Am I missing something and is it not possible? Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/05/15 at 05:23pm, Joerg Roedel wrote: On Tue, May 05, 2015 at 02:09:31PM +0800, Dave Young wrote: I agree that we can do nothing with the old corrupted data, but I worry about the future corruption after using the old corrupted data. I wonder if we can mark all the oldmem as readonly so that we can lower the risk. Is it resonable? Do you mean marking it read-only for the devices? That will very likely cause DMAR faults, re-introducing the problem this patch-set tries to fix. I means to block all dma write to oldmem, I believe it will cause DMA error. But all other DMA reading requests will continue and work. This will avoid future possible corruption. It will solve half of the problem at least? For the original problem, the key issue is dmar faults cause kdump kernel hang so that vmcore can not be saved. I do not know the reason why it hangs I think it is acceptable if kdump kernel boot ok with some dma errors.. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/04/15 at 06:23pm, Joerg Roedel wrote: On Fri, Apr 24, 2015 at 04:49:57PM +0800, Dave Young wrote: I'm more than happy to see this issue can be fixed in the patchset, I do not agree to add the code there with such problems. OTOH, for now seems there's no way to fix it. And that's the point. We discuss this issue and possible solutions for years by now, and what ZhenHua implemented is what we agreed to be the best-effort on what we can do in the kdump case with IOMMU enabled. Of course there are still failure scenarios left, but that is not different from systems without any IOMMU. The failure is nothing different, but as I said in another reply the difference is we could use corrupted data to possiblly cause more failure. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 05/04/15 at 01:05pm, Joerg Roedel wrote: On Fri, Apr 03, 2015 at 04:40:31PM +0800, Dave Young wrote: Have not read all the patches, but I have a question, not sure this has been answered before. Old memory is not reliable, what if the old memory get corrupted before panic? Is it safe to continue using it in 2nd kernel, I worry that it will cause problems. Yes, the old memory could be corrupted, and there are more failure cases left which we have no way of handling yet (if iommu data structures are in kdump backup areas). The question is what to do if we find some of the old data structures corrupted, hand how far should the tests go. Should we also check the page-tables, for example? I think if some of the data structures for a device are corrupted it probably already failed in the old kernel and things won't get worse in the new one. I agree that we can do nothing with the old corrupted data, but I worry about the future corruption after using the old corrupted data. I wonder if we can mark all the oldmem as readonly so that we can lower the risk. Is it resonable? Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/24/15 at 04:35pm, Baoquan He wrote: On 04/24/15 at 04:25pm, Dave Young wrote: Hi, Baoquan I support this patchset. We should not fear oldmem since reserved crashkernel region is similar. No one can guarantee that any crazy code won't step into crashkernel region just because 1st kernel says it's reversed for kdump kernel. Here the root table and context tables are also not built to allow legal code to danamge. Both of them has the risk to be corrupted, for trying our best to get a dumped vmcore the risk is worth being taken. old mem is mapped in 1st kernel so compare with the reserved crashkernel they are more likely to be corrupted. they are totally different. Could you tell how and why they are different? Wrong code will choose root tables and context tables to danamge when they totally lose control? iommu will map io address to system ram, right? not to reserved ram, but yes I'm assuming the page table is right, but I was worrying they are corrupted while kernel panic is happening. And the resetting pci way has been NACKed by David Woodhouse, the maintainer of intel iommu. Because the place calling the resetting pci code is ugly before kdump kernel or in kdump kernel. And as he said a certain device made mistakes why we blame on all devices. We should fix that device who made mistakes. Resetting pci bus is not ugly than fixing a problem with risk and to fix the problem it introduced in the future. There's a problem, we fix the problem. If that's uglier, I need redefine the 'ugly' in my personal dict. You mean the problem it could introduce is wrong code will damage root table and context tables, why don't we fix that wrong code, but blame innocent context tables? So you mean these tables should deserve being damaged by wrong code? I'm more than happy to see this issue can be fixed in the patchset, I do not agree to add the code there with such problems. OTOH, for now seems there's no way to fix it. I know it is late to speak out, but sorry I still object and have to NACK this oldmem approach from my point. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
Hi, On 04/21/15 at 09:39am, Li, ZhenHua wrote: Hi Dave, I found the old mail: http://lkml.iu.edu/hypermail/linux/kernel/1410.2/03584.html I know and I have read it before. == quote === So with this in mind I would prefer initially taking over the page-tables from the old kernel before the device drivers re-initialize the devices. This makes the dump kernel more dependent on data from the old kernel, which we obviously want to avoid when possible. Sure, but this is not really possible here (unless we have a generic and reliable way to reset all PCI endpoint devices and cancel all in-flight DMA before we disable the IOMMU in the kdump kernel). Otherwise we always risk data corruption somewhere, in system memory or on disk. = quote What I understand above is it is not really possible to avoid the problem. But IMHO we should avoid it or we will have problems in the future, if we really cannot avoid it I would say switching to pci reset way is better. Please check this and you will find the discussion. Regards Zhenhua On 04/15/2015 02:48 PM, Dave Young wrote: On 04/15/15 at 01:47pm, Li, ZhenHua wrote: On 04/15/2015 08:57 AM, Dave Young wrote: Again, I think it is bad to use old page table, below issues need consider: 1) make sure old page table are reliable across crash 2) do not allow writing oldmem after crash Please correct me if I'm wrong, or if above is not doable I think I will vote for resetting pci bus. Thanks Dave Hi Dave, When updating the context tables, we have to write their address to root tables, this will cause writing to old mem. Resetting the pci bus has been discussed, please check this: http://lists.infradead.org/pipermail/kexec/2014-October/012752.html https://lkml.org/lkml/2014/10/21/890 I know one reason to use old pgtable is this looks better because it fixes the real problem, but it is not a good way if it introduce more problems because of it have to use oldmem. I will be glad if this is not a problem but I have not been convinced. OTOH, there's many types of iommu, intel, amd, a lot of other types. They need their own fixes, so it looks not that elegant. For pci reset, it is not perfect, but it has another advantage, the patch is simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel is acceptable but it does not fix things on sparc platform. AFAIK current reported problems are intel and amd iommu, at least pci reset stuff does not make it worse. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/15/15 at 01:47pm, Li, ZhenHua wrote: On 04/15/2015 08:57 AM, Dave Young wrote: Again, I think it is bad to use old page table, below issues need consider: 1) make sure old page table are reliable across crash 2) do not allow writing oldmem after crash Please correct me if I'm wrong, or if above is not doable I think I will vote for resetting pci bus. Thanks Dave Hi Dave, When updating the context tables, we have to write their address to root tables, this will cause writing to old mem. Resetting the pci bus has been discussed, please check this: http://lists.infradead.org/pipermail/kexec/2014-October/012752.html https://lkml.org/lkml/2014/10/21/890 I know one reason to use old pgtable is this looks better because it fixes the real problem, but it is not a good way if it introduce more problems because of it have to use oldmem. I will be glad if this is not a problem but I have not been convinced. OTOH, there's many types of iommu, intel, amd, a lot of other types. They need their own fixes, so it looks not that elegant. For pci reset, it is not perfect, but it has another advantage, the patch is simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel is acceptable but it does not fix things on sparc platform. AFAIK current reported problems are intel and amd iommu, at least pci reset stuff does not make it worse. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/10/15 at 04:42pm, Li, Zhen-Hua wrote: This patchset is an update of Bill Sumner's patchset, implements a fix for: If a kernel boots with intel_iommu=on on a system that supports intel vt-d, when a panic happens, the kdump kernel will boot with these faults: dmar: DRHD: handling fault status reg 102 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8 DMAR:[fault reason 01] Present bit in root entry is clear dmar: DRHD: handling fault status reg 2 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear On some system, the interrupt remapping fault will also happen even if the intel_iommu is not set to on, because the interrupt remapping will be enabled when x2apic is needed by the system. The cause of the DMA fault is described in Bill's original version, and the INTR-Remap fault is caused by a similar reason. In short, the initialization of vt-d drivers causes the in-flight DMA and interrupt requests get wrong response. To fix this problem, we modifies the behaviors of the intel vt-d in the crashdump kernel: For DMA Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the translation, keep it enabled. 3. Use the old root entry table, do not rewrite the RTA register. 4. Malloc and use new context entry table, copy data from the old ones that used by the old kernel. 5. Keep using the old page tables before driver is loaded. 6. After device driver is loaded, when it issues the first dma_map command, free the dmar_domain structure for this device, and generate a new one, so that the device can be assigned a new and empty page table. 7. When a new context entry table is generated, we also save its address to the old root entry table. For Interrupt Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the interrupt remapping, keep it enabled. 3. Use the old interrupt remapping table, do not rewrite the IRTA register. 4. When ioapic entry is setup, the interrupt remapping table is changed, and the updated data will be stored to the old interrupt remapping table. Advantages of this approach: 1. All manipulation of the IO-device is done by the Linux device-driver for that device. 2. This approach behaves in a manner very similar to operation without an active iommu. 3. Any activity between the IO-device and its RMRR areas is handled by the device-driver in the same manner as during a non-kdump boot. 4. If an IO-device has no driver in the kdump kernel, it is simply left alone. This supports the practice of creating a special kdump kernel without drivers for any devices that are not required for taking a crashdump. 5. Minimal code-changes among the existing mainline intel vt-d code. Summary of changes in this patch set: 1. Added some useful function for root entry table in code intel-iommu.c 2. Added new members to struct root_entry and struct irte; 3. Functions to load old root entry table to iommu-root_entry from the memory of old kernel. 4. Functions to malloc new context entry table and copy the data from the old ones to the malloced new ones. 5. Functions to enable support for DMA remapping in kdump kernel. 6. Functions to load old irte data from the old kernel to the kdump kernel. 7. Some code changes that support other behaviours that have been listed. 8. In the new functions, use physical address as unsigned long type, not pointers. Original version by Bill Sumner: https://lkml.org/lkml/2014/1/10/518 https://lkml.org/lkml/2014/4/15/716 https://lkml.org/lkml/2014/4/24/836 Zhenhua's updates: https://lkml.org/lkml/2014/10/21/134 https://lkml.org/lkml/2014/12/15/121 https://lkml.org/lkml/2014/12/22/53 https://lkml.org/lkml/2015/1/6/1166 https://lkml.org/lkml/2015/1/12/35 https://lkml.org/lkml/2015/3/19/33 Changelog[v10]: 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel(). Use one flag which stores the te and ir status in last kernel: iommu-pre_enabled_trans iommu-pre_enabled_ir Changelog[v9]: 1. Add new function iommu_attach_domain_with_id. 2. Do not copy old page tables, keep using the old ones. 3. Remove functions: intel_iommu_did_to_domain_values_entry intel_iommu_get_dids_from_old_kernel device_to_domain_id copy_page_addr copy_page_table copy_context_entry copy_context_entry_table 4. Add new function device_to_existing_context_entry. Changelog[v8]: 1. Add a missing __iommu_flush_cache in function copy_page_table. Changelog[v7]: 1. Use __iommu_flush_cache to flush the data to hardware. Changelog[v6]: 1. Use unsigned long as type of physical address. 2. Use
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/07/15 at 11:46am, Dave Young wrote: On 04/05/15 at 09:54am, Baoquan He wrote: On 04/03/15 at 05:21pm, Dave Young wrote: On 04/03/15 at 05:01pm, Li, ZhenHua wrote: Hi Dave, There may be some possibilities that the old iommu data is corrupted by some other modules. Currently we do not have a better solution for the dmar faults. But I think when this happens, we need to fix the module that corrupted the old iommu data. I once met a similar problem in normal kernel, the queue used by the qi_* functions was written again by another module. The fix was in that module, not in iommu module. It is too late, there will be no chance to save vmcore then. Also if it is possible to continue corrupt other area of oldmem because of using old iommu tables then it will cause more problems. So I think the tables at least need some verifycation before being used. Yes, it's a good thinking anout this and verification is also an interesting idea. kexec/kdump do a sha256 calculation on loaded kernel and then verify this again when panic happens in purgatory. This checks whether any code stomps into region reserved for kexec/kernel and corrupt the loaded kernel. If this is decided to do it should be an enhancement to current patchset but not a approach change. Since this patchset is going very close to point as maintainers expected maybe this can be merged firstly, then think about enhancement. After all without this patchset vt-d often raised error message, hung. It does not convince me, we should do it right at the beginning instead of introduce something wrong. I wonder why the old dma can not be remap to a specific page in kdump kernel so that it will not corrupt more memory. But I may missed something, I will looking for old threads and catch up. I have read the old discussion, above way was dropped because it could corrupt filesystem. Apologize about late commenting. But current solution sounds bad to me because of using old memory which is not reliable. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/07/15 at 05:55pm, Li, ZhenHua wrote: On 04/07/2015 05:08 PM, Dave Young wrote: On 04/07/15 at 11:46am, Dave Young wrote: On 04/05/15 at 09:54am, Baoquan He wrote: On 04/03/15 at 05:21pm, Dave Young wrote: On 04/03/15 at 05:01pm, Li, ZhenHua wrote: Hi Dave, There may be some possibilities that the old iommu data is corrupted by some other modules. Currently we do not have a better solution for the dmar faults. But I think when this happens, we need to fix the module that corrupted the old iommu data. I once met a similar problem in normal kernel, the queue used by the qi_* functions was written again by another module. The fix was in that module, not in iommu module. It is too late, there will be no chance to save vmcore then. Also if it is possible to continue corrupt other area of oldmem because of using old iommu tables then it will cause more problems. So I think the tables at least need some verifycation before being used. Yes, it's a good thinking anout this and verification is also an interesting idea. kexec/kdump do a sha256 calculation on loaded kernel and then verify this again when panic happens in purgatory. This checks whether any code stomps into region reserved for kexec/kernel and corrupt the loaded kernel. If this is decided to do it should be an enhancement to current patchset but not a approach change. Since this patchset is going very close to point as maintainers expected maybe this can be merged firstly, then think about enhancement. After all without this patchset vt-d often raised error message, hung. It does not convince me, we should do it right at the beginning instead of introduce something wrong. I wonder why the old dma can not be remap to a specific page in kdump kernel so that it will not corrupt more memory. But I may missed something, I will looking for old threads and catch up. I have read the old discussion, above way was dropped because it could corrupt filesystem. Apologize about late commenting. But current solution sounds bad to me because of using old memory which is not reliable. Thanks Dave Seems we do not have a better solution for the dmar faults. But I believe we can find out how to verify the iommu data which is located in old memory. That will be great, thanks. So there's two things: 1) make sure old pg tables are right, this is what we were talking about. 2) avoid writing old memory, I suppose only dma read could corrupt filesystem, right? So how about for any dma writes just create a scratch page in 2nd kernel memory. Only using old page table for dma read. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/05/15 at 09:54am, Baoquan He wrote: On 04/03/15 at 05:21pm, Dave Young wrote: On 04/03/15 at 05:01pm, Li, ZhenHua wrote: Hi Dave, There may be some possibilities that the old iommu data is corrupted by some other modules. Currently we do not have a better solution for the dmar faults. But I think when this happens, we need to fix the module that corrupted the old iommu data. I once met a similar problem in normal kernel, the queue used by the qi_* functions was written again by another module. The fix was in that module, not in iommu module. It is too late, there will be no chance to save vmcore then. Also if it is possible to continue corrupt other area of oldmem because of using old iommu tables then it will cause more problems. So I think the tables at least need some verifycation before being used. Yes, it's a good thinking anout this and verification is also an interesting idea. kexec/kdump do a sha256 calculation on loaded kernel and then verify this again when panic happens in purgatory. This checks whether any code stomps into region reserved for kexec/kernel and corrupt the loaded kernel. If this is decided to do it should be an enhancement to current patchset but not a approach change. Since this patchset is going very close to point as maintainers expected maybe this can be merged firstly, then think about enhancement. After all without this patchset vt-d often raised error message, hung. It does not convince me, we should do it right at the beginning instead of introduce something wrong. I wonder why the old dma can not be remap to a specific page in kdump kernel so that it will not corrupt more memory. But I may missed something, I will looking for old threads and catch up. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/03/15 at 02:05pm, Li, Zhen-Hua wrote: The hardware will do some verification, but not completely. If people think the OS should also do this, then it should be another patchset, I think. If there is chance to corrupt more memory I think it is not a right way. We should think about a better solution instead of fix it later. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 04/03/15 at 05:01pm, Li, ZhenHua wrote: Hi Dave, There may be some possibilities that the old iommu data is corrupted by some other modules. Currently we do not have a better solution for the dmar faults. But I think when this happens, we need to fix the module that corrupted the old iommu data. I once met a similar problem in normal kernel, the queue used by the qi_* functions was written again by another module. The fix was in that module, not in iommu module. It is too late, there will be no chance to save vmcore then. Also if it is possible to continue corrupt other area of oldmem because of using old iommu tables then it will cause more problems. So I think the tables at least need some verifycation before being used. Thanks Zhenhua On 04/03/2015 04:40 PM, Dave Young wrote: To fix this problem, we modifies the behaviors of the intel vt-d in the crashdump kernel: For DMA Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the translation, keep it enabled. 3. Use the old root entry table, do not rewrite the RTA register. 4. Malloc and use new context entry table, copy data from the old ones that used by the old kernel. Have not read all the patches, but I have a question, not sure this has been answered before. Old memory is not reliable, what if the old memory get corrupted before panic? Is it safe to continue using it in 2nd kernel, I worry that it will cause problems. Hope I'm wrong though. Thanks Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel
On 03/19/15 at 01:36pm, Li, Zhen-Hua wrote: This patchset is an update of Bill Sumner's patchset, implements a fix for: If a kernel boots with intel_iommu=on on a system that supports intel vt-d, when a panic happens, the kdump kernel will boot with these faults: dmar: DRHD: handling fault status reg 102 dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff8 DMAR:[fault reason 01] Present bit in root entry is clear dmar: DRHD: handling fault status reg 2 dmar: INTR-REMAP: Request device [[61:00.0] fault index 42 INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear On some system, the interrupt remapping fault will also happen even if the intel_iommu is not set to on, because the interrupt remapping will be enabled when x2apic is needed by the system. The cause of the DMA fault is described in Bill's original version, and the INTR-Remap fault is caused by a similar reason. In short, the initialization of vt-d drivers causes the in-flight DMA and interrupt requests get wrong response. To fix this problem, we modifies the behaviors of the intel vt-d in the crashdump kernel: For DMA Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the translation, keep it enabled. 3. Use the old root entry table, do not rewrite the RTA register. 4. Malloc and use new context entry table, copy data from the old ones that used by the old kernel. 5. Keep using the old page tables before driver is loaded. 6. After device driver is loaded, when it issues the first dma_map command, free the dmar_domain structure for this device, and generate a new one, so that the device can be assigned a new and empty page table. 7. When a new context entry table is generated, we also save its address to the old root entry table. For Interrupt Remapping: 1. To accept the vt-d hardware in an active state, 2. Do not disable and re-enable the interrupt remapping, keep it enabled. 3. Use the old interrupt remapping table, do not rewrite the IRTA register. 4. When ioapic entry is setup, the interrupt remapping table is changed, and the updated data will be stored to the old interrupt remapping table. Advantages of this approach: 1. All manipulation of the IO-device is done by the Linux device-driver for that device. 2. This approach behaves in a manner very similar to operation without an active iommu. 3. Any activity between the IO-device and its RMRR areas is handled by the device-driver in the same manner as during a non-kdump boot. 4. If an IO-device has no driver in the kdump kernel, it is simply left alone. This supports the practice of creating a special kdump kernel without drivers for any devices that are not required for taking a crashdump. 5. Minimal code-changes among the existing mainline intel vt-d code. Summary of changes in this patch set: 1. Added some useful function for root entry table in code intel-iommu.c 2. Added new members to struct root_entry and struct irte; 3. Functions to load old root entry table to iommu-root_entry from the memory of old kernel. 4. Functions to malloc new context entry table and copy the data from the old ones to the malloced new ones. 5. Functions to enable support for DMA remapping in kdump kernel. 6. Functions to load old irte data from the old kernel to the kdump kernel. 7. Some code changes that support other behaviours that have been listed. 8. In the new functions, use physical address as unsigned long type, not pointers. Original version by Bill Sumner: https://lkml.org/lkml/2014/1/10/518 https://lkml.org/lkml/2014/4/15/716 https://lkml.org/lkml/2014/4/24/836 Zhenhua's updates: https://lkml.org/lkml/2014/10/21/134 https://lkml.org/lkml/2014/12/15/121 https://lkml.org/lkml/2014/12/22/53 https://lkml.org/lkml/2015/1/6/1166 https://lkml.org/lkml/2015/1/12/35 Changelog[v9]: 1. Add new function iommu_attach_domain_with_id. 2. Do not copy old page tables, keep using the old ones. 3. Remove functions: intel_iommu_did_to_domain_values_entry intel_iommu_get_dids_from_old_kernel device_to_domain_id copy_page_addr copy_page_table copy_context_entry copy_context_entry_table 4. Add new function device_to_existing_context_entry. Changelog[v8]: 1. Add a missing __iommu_flush_cache in function copy_page_table. Changelog[v7]: 1. Use __iommu_flush_cache to flush the data to hardware. Changelog[v6]: 1. Use unsigned long as type of physical address. 2. Use new function unmap_device_dma to unmap the old dma. 3. Some small incorrect bits order for aw shift. Changelog[v5]: 1. Do not disable and re-enable traslation and interrupt remapping. 2. Use old root entry table. 3. Use old interrupt
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Fri, Jun 14, 2013 at 10:11 AM, Takao Indoh indou.ta...@jp.fujitsu.comwrote: (2013/06/13 12:41), Bjorn Helgaas wrote: On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh indou.ta...@jp.fujitsu.com wrote: (2013/06/12 13:45), Bjorn Helgaas wrote: [+cc Vivek, Haren; sorry I didn't think to add you earlier] On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh indou.ta...@jp.fujitsu.com wrote: (2013/06/11 11:20), Bjorn Helgaas wrote: I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. Ok. pci_iommu_init() is appropriate place to add this hook? I looked at various IOMMU init places today, and it's far more complicated and varied than I had hoped. This reset scheme depends on enumerating PCI devices before we initialize the IOMMU used by those devices. x86 works that way today, but not all architectures do (see the sparc pci_fire_pbm_init(), for Sorry, could you tell me which part depends on architecture? Your patch works if PCIe devices are reset before the kdump kernel enables the IOMMU. On x86, this is possible because PCI enumeration happens before the IOMMU initialization. On sparc, the IOMMU is initialized before PCI devices are enumerated, so there would still be a window where ongoing DMA could cause an IOMMU error. Ok, understood, thanks. Hmmm, it seems to be difficult to find out method which is common to all architectures. So, what I can do for now is introducing reset scheme which is only for x86. 1) Change this patch so that it work only on x86 platform. For example call this reset code from x86_init.iommu.iommu_init() instead of fs_initcall. Or another idea is: 2) Enumerate PCI devices in IOMMU layer. That is: PCI layer Just provide interface to reset given strcut pci_dev. Maybe pci_reset_function() looks good for this purpose. IOMMU layer Determine which devices should be reset. On kernel boot, check if IOMMU is already active or not, and if active, check IOMMU page table and reset devices whose entry exists there. Of course, it might be possible to reorganize the sparc code to to the IOMMU init *after* it enumerates PCI devices. But I think that change would be hard to justify. And I think even on x86, it would be better if we did the IOMMU init before PCI enumeration -- the PCI devices depend on the IOMMU, so logically the IOMMU should be initialized first so the PCI devices can be associated with it as they are enumerated. So third idea is: 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We need to implement new code to enumerate PCI devices and reset them for this purpose. Idea 2 is not difficult to implement, but one problem is that this method may be dangerous. We need to scan IOMMU page table which is used in previous kernel, but it may be broken. Idea 3 seems to be difficult to implement... example). And I think conceptually, the IOMMU should be enumerated and initialized *before* the devices that use it. So I'm uncomfortable with that aspect of this scheme. It would be at least conceivable to reset the devices in the system kernel, before the kexec. I know we want to do as little as possible in the crashing kernel, but it's at least a possibility, and it might be cleaner. I bet this will be not accepted by kdump maintainer. Everything in panic kernel is unreliable. kdump is inherently unreliable. The kdump kernel doesn't start from an arbitrary machine state. We don't expect it to tolerate all CPUs running, for example. Maybe it should be expected to tolerate PCI devices running, either. What I wanted to say is that any resources of first kernel are unreliable. Under panic situation, struct pci_dev tree may be broken, or pci_lock may be already hold by someone, etc. So, if we do this in first kernel, maybe kdump needs its own code to enumerate PCI devices and reset them. Vivek? Ping Vivek Thanks, Takao Indoh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Regards Dave ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu