from:"lijiang"

Re: [Crash-utility][PATCH V4 1/9] Add RISCV64 framework code support

2022-11-02 Thread lijiang

On Fri, Oct 21, 2022 at 10:57 AM HAGIO KAZUHITO(萩尾　一仁)
 wrote:
>
> On 2022/10/21 11:42, Xianting Tian wrote:
> >
> > 在 2022/10/21 上午10:17, HAGIO KAZUHITO(萩尾 一仁) 写道:
> >> On 2022/10/20 10:50, Xianting Tian wrote:
> >>
> >>> diff --git a/README b/README
> >>> index 5abbce1..d589e72 100644
> >>> --- a/README
> >>> +++ b/README
> >>> @@ -37,7 +37,7 @@
> >>>  These are the current prerequisites:
> >>>  o  At this point, x86, ia64, x86_64, ppc64, ppc, arm, arm64, alpha, 
> >>> mips,
> >>> - mips64, s390 and s390x-based kernels are supported.  Other 
> >>> architectures
> >>> + mips64, riscv64, s390 and s390x-based kernels are supported.  Other 
> >>> architectures
> >>> may be addressed in the future.
> >> Sentences in the README are wrapped within 80 characters, I will change
> >
> > thanks,
> >
> > Do you need me to send V5 patch set to fix this?
>
> No, I will amend these when applying.
>

On the kernel side, some relevant kernel patches got ack,  it seems
they won't  change anymore.

And the V4 looks good to me, so: Ack.

Thanks.
Lianbo

> Thanks,
> Kazu
>
> >
> >> this to:
> >>
> >> + mips64, riscv64, s390 and s390x-based kernels are supported.  Other
> >> + architectures may be addressed in the future.
> >>
> >>>  o  One size fits all -- the utility can be run on any Linux kernel 
> >>> version
> >>> @@ -98,6 +98,8 @@
> >>> arm64 dumpfiles may be built by typing "make target=ARM64".
> >>>  o  On an x86_64 host, an x86_64 binary that can be used to analyze
> >>> ppc64le dumpfiles may be built by typing "make target=PPC64".
> >>> +  o  On an x86_64 host, an x86_64 binary that can be used to analyze
> >>> + riscv64 dumpfiles may be built by typing "make target=RISCV64".
> >>>  Traditionally when vmcores are compressed via the makedumpfile(8) 
> >>> facility
> >>>  the libz compression library is used, and by default the crash 
> >>> utility
> >>
> >>> diff --git a/help.c b/help.c
> >>> index 99214c1..253c71b 100644
> >>> --- a/help.c
> >>> +++ b/help.c
> >>> @@ -9512,7 +9512,7 @@ char *README[] = {
> >>>"  These are the current prerequisites: ",
> >>>"",
> >>>"  o  At this point, x86, ia64, x86_64, ppc64, ppc, arm, arm64, alpha, 
> >>> mips,",
> >>> -" mips64, s390 and s390x-based kernels are supported.  Other 
> >>> architectures",
> >>> +" mips64, riscv64, s390 and s390x-based kernels are supported.  
> >>> Other architectures",
> >>>" may be addressed in the future.",
> >>>"",
> >>>"  o  One size fits all -- the utility can be run on any Linux kernel 
> >>> version",
> >> Same as above.
> >>
> >> And help.c lacks this part, will add:
> >>
> >> @@ -9572,6 +9572,8 @@ README_ENTER_DIRECTORY,
> >>" arm64 dumpfiles may be built by typing \"make target=ARM64\".",
> >>"  o  On an x86_64 host, an x86_64 binary that can be used to analyze",
> >>" ppc64le dumpfiles may be built by typing \"make target=PPC64\".",
> >> +"  o  On an x86_64 host, an x86_64 binary that can be used to analyze",
> >> +" riscv64 dumpfiles may be built by typing \"make target=RISCV64\".",
> >>"",
> >>"  Traditionally when vmcores are compressed via the makedumpfile(8) 
> >> facility",
> >>"  the libz compression library is used, and by default the crash 
> >> utility",
> >>
> >>
> >> With these, the v4 crash patch set looks good to me.
> >>
> >> Acked-by: Kazuhito Hagio 
> >>
> >> Thanks,
> >> Kazu


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread lijiang

Thank you for the comment, H. Peter, Andy and Baoquan.

在 2021年04月12日 17:52, Baoquan He 写道:
> On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
>>
>>
>>> On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
>>>
>>> On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
 Why don't we do this unconditionally? At the very best we gain half a 
 megabyte of memory (except the trampoline, which has to live there, but it 
 is only a few kilobytes.)
>>>
>>> This is a great suggestion, thanks. I think we can fix it in this way to
>>> make code simpler. Then the specific caring of real mode in
>>> efi_free_boot_services() can be removed too.
>>>
>>
>> This whole situation makes me think that the code is buggy before and buggy 
>> after.
>>
>> The issue here (I think) is that various pieces of code want to reserve 
>> specific pieces of otherwise-available low memory for their own nefarious 
>> uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
>> too much.
> 
> Kdump kernel also need go through real mode code path during bootup. It
> is not different than normal kernel except that it skips the firmware
> resetting. So kdump kernel needs low 1M as system RAM just as normal
> kernel does. Here we reserve the whole low 1M with memblock_reserve()
> to avoid any later kernel or driver data reside in this area. Otherwise,
> we need dump the content of this area to vmcore. As we know, when crash
> happened, the old memory of 1st kernel should be untouched until vmcore
> dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> In the past, we used a back up region to copy out the low 1M area, and
> map the back up region into the low 1M area in vmcore elf file. In
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"), we changed to lock the whole low 1M to avoid
> writting any kernel data into, like this we can skip this area when
> dumping vmcore.
> 
> Above is why we try to memblock reserve the whole low 1M. We don't want
> to use it, just don't want anyone to use it in 1st kernel.
> 
>>
>> I propose that the right solution is to give low-memory-reserving code paths 
>> two chances to do what they need: once at the very beginning and once after 
>> EFI boot services are freed.
>>
>> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
>> then release it right after releasing boot services, and then invoke the 
>> special cases exactly once.
> 

After EFI boot services are freed, I'm worried that it's a bit late. All sub-1M 
memory regions need to be reserved early as soon as possible.

> I am not sure if I got both suggested ways clearly. They look a little
> complicated in our case. As I explained at above, we want the whole low
> 1M locked up, not one piece or some pieces of it.
> 
>>
>> In either case, the result is that the crashkernel mess gets unified with 
>> the trampoline mess.  One way the result is called twice and needs to be 
>> more careful, and the other way it’s called only once.
>>

That may still have a chance to allocate memory from sub-1M regions at some 
point, because EFI boot services will be freed after EFI enters virtual mode, 
it looks late.

>> Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
>> services, and skipping that is incorrect, I think. And it seems to result in 
>> a bogus memory map in which the system thinks that some crashkernel memory 
>> is EFI memory instead.
> 
> I like hpa's thought to lock the whole low 1M unconditionally since only
> a few KB except of trampoline area is there. Rethinking about it, doing
> it in can_free_region() may be risky because efi memory region could
> cross the 1M boundary, e.g [640K, 100M] with type of
> EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory.

Theoretically, yes. But so far I haven't seen the situation of crossing the 1M 
boundary.

Thanks.
Lianbo

> Just a wild guess, not very sure if the 1M boundary corssing can really
> happen. efi_reserve_boot_services() won't split regions.
> 
> If moving efi_reserve_boot_services() after reserve_real_mode() is not
> accepted, maybe we can call efi_mem_reserve(0, 1M) just as
> efi_esrt_init() has done.
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-09 Thread lijiang

Hi, Baoquan
Thank you for the comment.
在 2021年04月09日 20:44, Baoquan He 写道:
> On 04/07/21 at 10:03pm, Lianbo Jiang wrote:
>> Some sub-1MB memory regions may be reserved by EFI boot services, and the
>> memory regions will be released later in the efi_free_boot_services().
>>
>> Currently, always reserve all sub-1MB memory regions when the crashkernel
>> option is specified, but unfortunately EFI boot services may have already
>> reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
>> called, which makes that the crash_reserve_low_1M() only own the
>> remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
>> subsequently EFI boot services will free its own sub-1MB memory regions.
>> Eventually, DMA will be able to allocate memory from the sub-1MB area and
>> cause the following error:
>>
> 
> So this patch is fixing a problem found in crash utility. We ever met
> the similar issue, later fixed by always reserving low 1M in commit
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"). Seems the commit is not fixing it completely.
> 
Maybe I should add the "Fixes: 6f599d84231f" in front of 'Signed-off-by' as 
below:

Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel 
option is specified")

>> crash> kmem -s |grep invalid
>> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
>> 9403c0067300
>> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
>> 9403c0067300
>> crash> vtop 9403c0067300
>> VIRTUAL   PHYSICAL
>> 9403c0067300  67300   --->The physical address falls into this range 
>> [0x00063000-0x0008efff]
>>
>> kernel debugging log:
>> ...
>> [0.008927] memblock_reserve: [0x0001-0x00013fff] 
>> efi_reserve_boot_services+0x85/0xd0
>> [0.008930] memblock_reserve: [0x00063000-0x0008efff] 
>> efi_reserve_boot_services+0x85/0xd0
>> ...
>> [0.009425] memblock_reserve: [0x-0x000f] 
>> crash_reserve_low_1M+0x2c/0x49
>> ...
>> [0.010586] Zone ranges:
>> [0.010587]   DMA  [mem 0x1000-0x00ff]
>> [0.010589]   DMA32[mem 0x0100-0x]
>> [0.010591]   Normal   [mem 0x0001-0x000c7fff]
>> [0.010593]   Device   empty
>> ...
>> [8.814894] __memblock_free_late: [0x00063000-0x0008efff] 
>> efi_free_boot_services+0x14b/0x23b
>> [8.815793] __memblock_free_late: [0x0001-0x00013fff] 
>> efi_free_boot_services+0x14b/0x23b
> 
> 
> In commit 6f599d84231fd27, we call crash_reserve_low_1M() to lock the
> whole low 1M area if crashkernel is specified in kernel cmdline.
> But earlier efi_reserve_boot_services() invokation will break the
> intention of the whole low 1M reserving. In efi_reserve_boot_services(),
> if any memory under low 1M hasn't been reserved, it will call
> memblock_reserve() to reserve it and leave it to
> efi_free_boot_services() to free.
> 

Good understanding.

> Hi Lianbo,
> 
> Please correct me if I am wrong or anything is missed. IIUC, can we move
> efi_reserve_boot_services() after reserve_real_mode() to fix this bug?

What do you think about the following changes?

patch [1]:

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5ecd69a48393..c343de3178ec 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1064,12 +1064,6 @@ void __init setup_arch(char **cmdline_p)
efi_esrt_init();
efi_mokvar_table_init();
 
-   /*
-* The EFI specification says that boot service code won't be
-* called after ExitBootServices(). This is, in fact, a lie.
-*/
-   efi_reserve_boot_services();
-
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
 
@@ -1087,6 +1081,12 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
 
+   /*
+* The EFI specification says that boot service code won't be
+* called after ExitBootServices(). This is, in fact, a lie.
+*/
+   efi_reserve_boot_services();
+
init_mem_mapping();
 
idt_setup_early_pf();

> Or move reserve_real_mode() before efi_reserve_boot_services() since
> those real mode regions are all under 1M? Assume efi boot code/data

Or patch [2]

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5ecd69a48393..ceec5af0dfab 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1058,6 +1058,7 @@ void __init setup_arch(char **cmdline_p)
sev_setup_arch();
 
reserve_bios_regions();
+   reserve_real_mode();
 
efi_fake_memmap();
efi_find_mirror();
@@ -1082,8 +1083,6 @@ void __init setup_arch(char **cmdline_p)
(max_pfn_mapped< won't rely on low 1M area any more at this moment.
>

Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

2020-11-17 Thread lijiang

Hi, Kazu, Julien and Bhupesh

在 2020年10月28日 16:32, HAGIO KAZUHITO(萩尾　一仁) 写道:
> I'm rethinking about what command options makedumpfile should have.
> If once we add an option to makedumpfile, we cannot change it easily,
> so I'd like to think carefully.
> 
> The calculated size might be useful if it's printed so that it can be
> easily post-processed by scripts, e.g. for automated tests.  If so,
> makedumpfile already prints its statistics with "--message-level 16",
> and it might be useful to also print them by an option like "--show-stats".
> 
>   # makedumpfile --show-stats -l -d 31 vmcore dump.ld31
>   total_pages xxx
>   excluded_pages yyy
>   ...
>   write_bytes zzz
> 
> Also, if we also have "--dry-run" option to not write actually, it's
> explicit and meets Bhupesh's use case.  What do you think?
> 

It seems that adding a statistical option could be better than nothing.

Do you have any decisions on this issue? Or any thoughts?


Thanks
Lianbo

> Thanks,
> Kazu


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

2020-11-02 Thread lijiang

在 2020年10月30日 14:29, HAGIO KAZUHITO(萩尾　一仁) 写道:
> -Original Message-
>> 在 2020年10月28日 16:32, HAGIO KAZUHITO(萩尾　一仁) 写道:
>>> Hi Julien,
>>>
>>> sorry for my delayed reply.
>>>
>>> -Original Message-
> A user might want to know how much space a vmcore file will take on
> the system and how much space on their disk should be available to
> save it during a crash.
>
> The option --vmcore-size does not create the vmcore file but provides
> an estimation of the size of the final vmcore file created with the
> same make dumpfile options.
>
> Interesting.  Do you have any actual use case?  e.g. used by kdumpctl?
> or use it in kdump initramfs?
>

 Yes, the idea would be to use this in mkdumprd to have a more accurate
 estimate of the dump size (currently it cannot take compression into
 account and warns about potential lack of space, considering the system
 memory size as a whole).
>>>
>>> Hmm, I'm not sure how you are going to implement in mkdumprd, but I do not
>>> recommend that you use it to determine how much disk space should be
>>> allocated for crash dump.  Because, I think that
>>>
>>> - It cannot estimate the dump size when a real crash occurs, e.g. if slab
>>> explodes with non-zero data, almost all memory will be captured by 
>>> makedumpfile
>>
>> I agree with you, but this could be rare? If yes, I'm not sure if it is worth
>> thinking more about the rare situations.
> 
> Cases that a dumpfile is inflated with -d 31 might be rare, but if users
> need user data, e.g. for gcore, underestimation will occur easily.
> 
Yes, that's true.

>>
>>> even with -d 31, and compression ratio varies with data in memory.
>>
>> Indeed.
>>
>>> Also, in most cases, mkdumprd runs at boot time or construction phase
>>> with less memory usage, not at usual application running time.  So it
>>> can underestimate the needed size easily.
>>>
>> If administrator can monitor the estimated size periodically, maybe it
>> won't be a problem?
> 
> I think most of them cannot or do not do that, and even if they could do,
> when a panic occurs by an unknown problem, can you depend on that estimation?
> 
This requires user to evaluate the risk. The tools only provide a reference
value at a certain time point, and remind users of such risks.

>>
>>> - The system might need a full vmcore and need to change makedumpfile's
>>> dump level for an issue in the future.  But many systems cannot change
>>> their disk space allocation easily.  So we should prevent users from
>>> having minimum disk space for crash dump.
>>>
>>> So, the following is from mkdumprd on Fedora 32, personally I think this
>>> is good for now.
>>>
>>> if [ $avail -lt $memtotal ]; then
>>> echo "Warning: There might not be enough space to save a vmcore."
>>> echo " The size of $2 should be greater than $memtotal kilo 
>>> bytes."
>>> fi
>>>
>> Currently, some users are complaining that mkdumprd overestimates the needed 
>> size,
>> and most vmcores are significantly smaller than the size of system memory.
>>
>> Furthermore, in most cases, the system memory will not be completely 
>> exhausted, but
>> that still depends on how the memory is used in the system, for example:
>> [1] make the stressful test for memory
>> [2] always occupies amount of memory and not release it.
>>
>> For the above two cases, there may be rare.
> 
> I've seen and worked on thousands of support cases, memory is exhausted
> easily and unexpectedly..  Especially nowadays I often see panics by
> vm.panic_on_oom.
> 
>> Therefore, can we find out a compromise
>> between the size of vmcore and system memory so that makedumpfile can 
>> estimate the
>> size of vmcore more accurately?
>>
>> And finally, mkdumprd can use the estimated size of vmcore instead of system 
>> memory(memtotal)
>> to determine if the target disk has enough space to store vmcore.
> 
> The current mkdumprd just warns the possibility of lack of space,
> it doesn't fail.  I think this is a good balance.
> 
> Users can choose the estimated size over the whole memory size with
> their discretion.  Providing the useful estimation tool for them
> might be good.
> 
> But, if we do so, we should let users know the tradeoff between the
> disk space and the risk of failure.  So I believe that we should
> continue to warn the possibility of failure of capturing vmcore
> with less space than the whole memory.
> 
Our understanding is consistent about this issue. Maybe we could have a document
to explain the details.

Thanks.
Lianbo

> Thanks,
> Kazu
> 
> 
>>
>>
>> Thanks.
>> Lianbo
>>
>>> The patch's functionality itself might be useful and I don't reject, though.
>>>
> @@ -4643,6 +4706,8 @@ write_buffer(int fd, off_t offset, void *buf, 
> size_t buf_size, char *file_name)
>   }
>   if (!write_and_check_space(fd, , sizeof(fdh), 
>

Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

2020-10-29 Thread lijiang

在 2020年10月28日 16:32, HAGIO KAZUHITO(萩尾　一仁) 写道:
> Hi Julien,
> 
> sorry for my delayed reply.
> 
> -Original Message-
>>> A user might want to know how much space a vmcore file will take on
>>> the system and how much space on their disk should be available to
>>> save it during a crash.
>>>
>>> The option --vmcore-size does not create the vmcore file but provides
>>> an estimation of the size of the final vmcore file created with the
>>> same make dumpfile options.
>>>
>>> Interesting.  Do you have any actual use case?  e.g. used by kdumpctl?
>>> or use it in kdump initramfs?
>>>
>>
>> Yes, the idea would be to use this in mkdumprd to have a more accurate
>> estimate of the dump size (currently it cannot take compression into
>> account and warns about potential lack of space, considering the system
>> memory size as a whole).
> 
> Hmm, I'm not sure how you are going to implement in mkdumprd, but I do not
> recommend that you use it to determine how much disk space should be
> allocated for crash dump.  Because, I think that
> 
> - It cannot estimate the dump size when a real crash occurs, e.g. if slab
> explodes with non-zero data, almost all memory will be captured by 
> makedumpfile

I agree with you, but this could be rare? If yes, I'm not sure if it is worth
thinking more about the rare situations.

> even with -d 31, and compression ratio varies with data in memory.

Indeed.

> Also, in most cases, mkdumprd runs at boot time or construction phase
> with less memory usage, not at usual application running time.  So it
> can underestimate the needed size easily.
> 
If administrator can monitor the estimated size periodically, maybe it
won't be a problem?
 
> - The system might need a full vmcore and need to change makedumpfile's
> dump level for an issue in the future.  But many systems cannot change
> their disk space allocation easily.  So we should prevent users from
> having minimum disk space for crash dump.
> 
> So, the following is from mkdumprd on Fedora 32, personally I think this
> is good for now.
> 
> if [ $avail -lt $memtotal ]; then
> echo "Warning: There might not be enough space to save a vmcore."
> echo " The size of $2 should be greater than $memtotal kilo 
> bytes."
> fi
> 
Currently, some users are complaining that mkdumprd overestimates the needed 
size,
and most vmcores are significantly smaller than the size of system memory.

Furthermore, in most cases, the system memory will not be completely exhausted, 
but
that still depends on how the memory is used in the system, for example:
[1] make the stressful test for memory
[2] always occupies amount of memory and not release it.

For the above two cases, there may be rare. Therefore, can we find out a 
compromise
between the size of vmcore and system memory so that makedumpfile can estimate 
the
size of vmcore more accurately?

And finally, mkdumprd can use the estimated size of vmcore instead of system 
memory(memtotal)
to determine if the target disk has enough space to store vmcore.


Thanks.
Lianbo

> The patch's functionality itself might be useful and I don't reject, though.
> 
>>> @@ -4643,6 +4706,8 @@ write_buffer(int fd, off_t offset, void *buf, 
>>> size_t buf_size, char *file_name)
>>>   }
>>>   if (!write_and_check_space(fd, , sizeof(fdh), 
>>> file_name))
>>>   return FALSE;
>>> +   } else if (info->flag_vmcore_size && fd == info->fd_dumpfile) {
>>> +   return write_buffer_update_size_info(offset, buf, 
>>> buf_size);
>>>
>>> Why do we need this function?  makedumpfile actually writes zero-filled
>>> pages to the dumpfile with -d 0, and doesn't write them with -d 1.
>>> So isn't "write_bytes += buf_size" enough?  For example, with -d 30,
>>>
>>
>> The reason I went with this method was to make an estimate of the number
>> of blocks actually allocated on the disk (since depending on how the
>> data written is scattered in the file, there might be a significant
>> difference between bytes written vs actual size allocated on disk). But
>> I realize that there is some misunderstanding from my end since written
>> 0 do make block allocation as opposed to not writing at some offset
>> (skipping the with lseek() ), I would need to fix that.
>>
>> To highlight the behaviour I'm talking about:
>> $ dd if=/dev/zero of=./testfile bs=4096 count=1 seek=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000302719 s, 13.5 MB/s
>> $ du -h testfile
>> 4.0K testfile
>>
>> $ dd if=/dev/zero of=./testfile bs=4096 count=2
>> 2+0 records in
>> 2+0 records out
>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000373002 s, 22.0 MB/s
>> $ du -h testfile
>> 8.0K testfile
>>
>>
>> So, do you think it's not worth bothering estimating the number of
>> blocks allocated an that I should only consider the number of bytes written?
> 
> Yes, makedumpfile

Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

2020-10-13 Thread lijiang

在 2020年10月13日 17:27, Bhupesh Sharma 写道:
> Hello Julien,
> 
> Thanks for the patch. Some nitpicks inline:
> 
> On Mon, Oct 12, 2020 at 12:39 PM Julien Thierry  wrote:
>>
>> A user might want to know how much space a vmcore file will take on
>> the system and how much space on their disk should be available to
>> save it during a crash.
>>
>> The option --vmcore-size does not create the vmcore file but provides
>> an estimation of the size of the final vmcore file created with the
>> same make dumpfile options.
>>
>> Signed-off-by: Julien Thierry 
>> Cc: Kazuhito Hagio 
>> ---
>>  makedumpfile.c | 98 --
>>  makedumpfile.h | 12 +++
>>  print_info.c   |  4 +++
>>  3 files changed, 111 insertions(+), 3 deletions(-)
> 
> Please update 'makedumpfile.8' as well in v2, so that the man page can
> document the newly added option and how to use it to determine the
> vmcore-size.
> 
>> diff --git a/makedumpfile.c b/makedumpfile.c
>> index 4c4251e..0a2bfba 100644
>> --- a/makedumpfile.c
>> +++ b/makedumpfile.c
>> @@ -26,6 +26,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> 
> I know we don't follow alphabetical order for include files in
> makedumpfile code, but it would be good to place the new - ones
> accordingly. So  can go with  here.
> 
>>  struct symbol_tablesymbol_table;
>>  struct size_table  size_table;
>> @@ -1366,7 +1367,25 @@ open_dump_file(void)
>> if (!info->flag_force)
>> open_flags |= O_EXCL;
>>
>> -   if (info->flag_flatten) {
>> +   if (info->flag_vmcore_size) {
>> +   char *namecpy;
>> +   struct stat statbuf;
>> +   int res;
>> +
>> +   namecpy = strdup(info->name_dumpfile ?
>> +info->name_dumpfile : ".");
>> +
>> +   res = stat(dirname(namecpy), );
>> +   free(namecpy);
>> +   if (res != 0)
>> +   return FALSE;
>> +
>> +   fd = -1;
>> +   info->dumpsize_info.blksize = statbuf.st_blksize;
>> +   info->dumpsize_info.block_buff_size = BASE_NUM_BLOCKS;
>> +   info->dumpsize_info.block_info = calloc(BASE_NUM_BLOCKS, 1);
>> +   info->dumpsize_info.non_hole_blocks = 0;
>> +   } else if (info->flag_flatten) {
>> fd = STDOUT_FILENO;
>> info->name_dumpfile = filename_stdout;
>> } else if ((fd = open(info->name_dumpfile, open_flags,
>> @@ -1384,6 +1403,9 @@ check_dump_file(const char *path)
>>  {
>> char *err_str;
>>
>> +   if (info->flag_vmcore_size)
>> +   return TRUE;
>> +
>> if (access(path, F_OK) != 0)
>> return TRUE; /* File does not exist */
>> if (info->flag_force) {
>> @@ -4622,6 +4644,47 @@ write_and_check_space(int fd, void *buf, size_t 
>> buf_size, char *file_name)
>> return TRUE;
>>  }
>>
>> +static int
>> +write_buffer_update_size_info(off_t offset, void *buf, size_t buf_size)
>> +{
>> +   struct dumpsize_info *dumpsize_info = >dumpsize_info;
>> +   int blk_end_idx = (offset + buf_size - 1) / dumpsize_info->blksize;
>> +   int i;
>> +
>> +   /* Need to grow the dumpsize block buffer? */
>> +   if (blk_end_idx >= dumpsize_info->block_buff_size) {
>> +   int alloc_size = MAX(blk_end_idx - 
>> dumpsize_info->block_buff_size, BASE_NUM_BLOCKS);
>> +
>> +   dumpsize_info->block_info = 
>> realloc(dumpsize_info->block_info,
>> +   
>> dumpsize_info->block_buff_size + alloc_size);
>> +   if (!dumpsize_info->block_info) {
>> +   ERRMSG("Not enough memory\n");
>> +   return FALSE;
>> +   }
>> +
>> +   memset(dumpsize_info->block_info + 
>> dumpsize_info->block_buff_size,
>> +  0, alloc_size);
>> +   dumpsize_info->block_buff_size += alloc_size;
>> +   }
>> +
>> +   for (i = 0; i < buf_size; ++i) {
>> +   int blk_idx = (offset + i) / dumpsize_info->blksize;
>> +
>> +   if (dumpsize_info->block_info[blk_idx]) {
>> +   i += dumpsize_info->blksize;
>> +   i = i - (i % dumpsize_info->blksize) - 1;
>> +   continue;
>> +   }
>> +
>> +   if (((char *) buf)[i] != 0) {
>> +   dumpsize_info->non_hole_blocks++;
>> +   dumpsize_info->block_info[blk_idx] = 1;
>> +   }
>> +   }
>> +
>> +   return TRUE;
>> +}
>> +
>>  int
>>  write_buffer(int fd, off_t offset, void *buf, size_t buf_size, char 
>> *file_name)
>>  {
>> @@ -4643,6 +4706,8 @@ write_buffer(int fd, off_t offset, void *buf, size_t 
>> buf_size, char *file_name)
>> }
>> if (!write_and_check_space(fd, , sizeof(fdh), file_name))
>>

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2020-10-01 Thread lijiang

Hi, Alexander

在 2020年09月30日 18:23, Alexander Egorenkov 写道:
> The offset of the field 'init_uts_ns.name' has changed
> since commit 9a56493f6942 ("uts: Use generic ns_common::count").
> 
> Link: 
> https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomain
> 
> Make the offset of the field 'uts_namespace.name' available
> in VMCOREINFO because tools like 'crash-utility' and
> 'makedumpfile' must be able to read it from crash dumps.
> 
> Signed-off-by: Alexander Egorenkov 
> ---
> 
> v2 -> v3:
>  * Added documentation to vmcoreinfo.rst
>  * Use the short form of the commit reference
> 
> v1 -> v2:
>  * Improved commit message
>  * Added link to the discussion of the uts namespace changes
> 
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 6 ++
>  kernel/crash_core.c| 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
> b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index e44a6c01f336..3861a25faae1 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -39,6 +39,12 @@ call.
>  User-space tools can get the kernel name, host name, kernel release
>  number, kernel version, architecture name and OS type from it.
>  
> +(uts_namespace, name)
> +-
> +
> +Offset of the name's member. Crash Utility and Makedumpfile get
> +the start address of the init_uts_ns.name from this.
> +

Thank you for the update. The v3 looks good to me.

>  node_online_map
>  ---
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..173fdc261882 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -447,6 +447,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_PAGESIZE(PAGE_SIZE);
>  
>   VMCOREINFO_SYMBOL(init_uts_ns);
> + VMCOREINFO_OFFSET(uts_namespace, name);
>   VMCOREINFO_SYMBOL(node_online_map);
>  #ifdef CONFIG_MMU
>   VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2020-09-30 Thread lijiang

Hi, Alexander

在 2020年09月24日 20:46, Alexander Egorenkov 写道:
> The offset of the field 'init_uts_ns.name' has changed
> since
> 
> commit 9a56493f6942c0e2df1579986128721da96e00d8
> Author: Kirill Tkhai 
> Date:   Mon Aug 3 13:16:21 2020 +0300
> 
> uts: Use generic ns_common::count
> 
> Link: 
> https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomain
> 
> Make the offset of the field 'uts_namespace.name' available
> in VMCOREINFO because tools like 'crash-utility' and
> 'makedumpfile' must be able to read it from crash dumps.
> 
> Signed-off-by: Alexander Egorenkov 
> ---
>  kernel/crash_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..173fdc261882 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -447,6 +447,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_PAGESIZE(PAGE_SIZE);
>  
>   VMCOREINFO_SYMBOL(init_uts_ns);
> + VMCOREINFO_OFFSET(uts_namespace, name);

Since the new symbol is exported, would you mind adding it to 
Documentation/kdump/vmcoreinfo.txt ?

Thanks.
Lianbo
>   VMCOREINFO_SYMBOL(node_online_map);
>  #ifdef CONFIG_MMU
>   VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] docs: admin-guide: update kdump documentation due to change of crash URL

2020-09-23 Thread lijiang

Since crash utility has been moved to github, the original URL is no
longer available. Let's update it accordingly.

Suggested-by: Dave Young 
Signed-off-by: Lianbo Jiang 
---
 Documentation/admin-guide/kdump/kdump.rst | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst 
b/Documentation/admin-guide/kdump/kdump.rst
index 2da65fef2a1c..75a9dd98e76e 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -509,9 +509,12 @@ ELF32-format headers using the --elf32-core-headers kernel 
option on the
 dump kernel.
 
 You can also use the Crash utility to analyze dump files in Kdump
-format. Crash is available on Dave Anderson's site at the following URL:
+format. Crash is available at the following URL:
 
-   http://people.redhat.com/~anderson/
+   https://github.com/crash-utility/crash
+
+Crash document can be found at:
+   https://crash-utility.github.io/
 
 Trigger Kdump on WARN()
 ===
-- 
2.17.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] docs: admin-guide: update kdump documentation due to change of crash URL

2020-09-22 Thread lijiang

在 2020年09月18日 16:09, Lianbo Jiang 写道:
> Since crash utility has moved to github, the original URL is no longer
   ^
  has been moved to github

Because of the above mistake, I'd like to correct it and reply it with the v2.

Thanks.

> available. Let's update it accordingly.
> 
> Suggested-by: Dave Young 
> Signed-off-by: Lianbo Jiang 
> ---
>  Documentation/admin-guide/kdump/kdump.rst | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 2da65fef2a1c..75a9dd98e76e 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -509,9 +509,12 @@ ELF32-format headers using the --elf32-core-headers 
> kernel option on the
>  dump kernel.
>  
>  You can also use the Crash utility to analyze dump files in Kdump
> -format. Crash is available on Dave Anderson's site at the following URL:
> +format. Crash is available at the following URL:
>  
> -   http://people.redhat.com/~anderson/
> +   https://github.com/crash-utility/crash
> +
> +Crash document can be found at:
> +   https://crash-utility.github.io/
>  
>  Trigger Kdump on WARN()
>  ===
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-07 Thread lijiang

在 2020年07月03日 19:54, John Ogness 写道:
> On 2020-07-02, lijiang  wrote:
>> About the VMCOREINFO part, I made some tests based on the kernel patch
>> v3, the makedumpfile and crash-utility can work as expected with your
>> patch(userspace patch), but, unfortunately, the
>> vmcore-dmesg(kexec-tools) can't correctly read the printk ring buffer
>> information, and get the following error:
>>
>> "Missing the log_buf symbol"
>>
>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>> in the makedumpfile and crash-utility.
> 
> A patched kexec-tools is available here [0].
> 
> I did not test using 32-bit dumps on 64-bit machines and vice versa. But
> it should work.
> 
> John Ogness
> 
> [0] https://github.com/Linutronix/kexec-tools.git (printk branch)
> 

After applying this patch, the vmcore-dmesg can work.

Thank you, John Ogness.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-03 Thread lijiang

在 2020年07月02日 21:31, Petr Mladek 写道:
> On Thu 2020-07-02 17:43:22, lijiang wrote:
>> 在 2020年07月02日 17:02, John Ogness 写道:
>>> On 2020-07-02, lijiang  wrote:
>>>> About the VMCOREINFO part, I made some tests based on the kernel patch
>>>> v3, the makedumpfile and crash-utility can work as expected with your
>>>> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
>>>> can't correctly read the printk ring buffer information, and get the
>>>> following error:
>>>>
>>>> "Missing the log_buf symbol"
>>>>
>>>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>>>> in the makedumpfile and crash-utility.
>>>
>>> Yes, a patch for this is needed (as well as for any other related
>>> software floating around the internet).
>>>
>>> I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
>>> would be quite straight forward to port the makedumpfile patch. I will
>>
>> Yes, it should be a similar patch.
>>
>>> try to make some time for this.
>>>
>> That would be nice. Thank you, John Ogness.
>>
>>> I do not want to patch any other software for this. I think with 3
>>> examples (crash, makedumpfile, vmcore-dmesg), others should be able to
>>
>> It's good enough to have the patch for the makedumpfile, crash and 
>> vmcore-dmesg,
>> which can ensure the kdump(userspace) work well.
> 
> I agree that this three are the most important ones and should be
> enough.
> 
> Thanks a lot for working on it and testing it.
> 
My pleasure. I will test the vmcore-dmesg later.

Thanks.
Lianbo

> Best Regards,
> Petr
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread lijiang

在 2020年07月02日 17:02, John Ogness 写道:
> On 2020-07-02, lijiang  wrote:
>> About the VMCOREINFO part, I made some tests based on the kernel patch
>> v3, the makedumpfile and crash-utility can work as expected with your
>> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
>> can't correctly read the printk ring buffer information, and get the
>> following error:
>>
>> "Missing the log_buf symbol"
>>
>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>> in the makedumpfile and crash-utility.
> 
> Yes, a patch for this is needed (as well as for any other related
> software floating around the internet).
> 
> I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
> would be quite straight forward to port the makedumpfile patch. I will

Yes, it should be a similar patch.

> try to make some time for this.
> 
That would be nice. Thank you, John Ogness.

> I do not want to patch any other software for this. I think with 3
> examples (crash, makedumpfile, vmcore-dmesg), others should be able to

It's good enough to have the patch for the makedumpfile, crash and vmcore-dmesg,
which can ensure the kdump(userspace) work well.

Thanks.
Lianbo

> implement the changes to their software without needing my help.
> 
> John Ogness
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

2020-07-02 Thread lijiang

Hi, John Ogness

About the VMCOREINFO part, I made some tests based on the kernel patch
v3, the makedumpfile and crash-utility can work as expected with your
patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
can't correctly read the printk ring buffer information, and get the
following error:

"Missing the log_buf symbol"

The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
in the makedumpfile and crash-utility.

BTW: If you already have a patch for the kexec-tools, would you mind sharing
it? I will make a test for the vmcore-dmesg.

Thanks.
Lianbo

在 2020年06月18日 22:49, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> Changes that exist due to the ringbuffer replacement:
> 
> - The VMCOREINFO has been updated for the new structures.
> 
> - Dictionary data is now stored in a separate data buffer from the
>   human-readable messages. The dictionary data buffer is set to the
>   same size as the message buffer. Therefore, the total required
>   memory for both dictionary and message data is
>   2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and
>   2 * log_buf_len (the kernel parameter) for the dynamic buffers.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes
>   for the static array and 72 * (log_buf_len >> 5) bytes for the
>   dynamic array.
> 
> Signed-off-by: John Ogness 
> ---
>  include/linux/kmsg_dump.h |   2 -
>  kernel/printk/printk.c| 944 --
>  2 files changed, 497 insertions(+), 449 deletions(-)
> 
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 3378bcbe585e..c9b0abe5ca91 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -45,8 +45,6 @@ struct kmsg_dumper {
>   bool registered;
>  
>   /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
>   u64 cur_seq;
>   u64 next_seq;
>  };
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8c14835be46c..7642ef634956 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -55,6 +55,7 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#include "printk_ringbuffer.h"
>  #include "console_cmdline.h"
>  #include "braille.h"
>  #include "internal.h"
> @@ -294,30 +295,24 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> + * Every record meta-data carries the timestamp in microseconds, as well as
> + * the standard userspace syslog level and syslog facility. The usual kernel
> + * messages use LOG_KERN; userspace-injected messages always carry a matching
> + * syslog facility, by default LOG_USER. The origin of every message can be
> + * reliably determined that way.
>   *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> + * The human readable log message of a record is available in @text, the
> + * length of the message text in @text_len. The stored message is not
> + * terminated.
>   *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> - * kernel messages use LOG_KERN; userspace-injected messages always carry
> - * a matching syslog facility, by default LOG_USER. The origin of every
> - * message can be reliably determined that way.
> - *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> - *
> - * Optionally, a message can carry a dictionary of properties (key/value 
> pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value
> + * pairs), to provide userspace with a machine-readable message context. The
> + * length of the dictionary is available in @dict_len. The dictionary is not
> + * terminated.
>   *
>   * Examples for

Re: [PATCH v2] kexec: Do not verify the signature without the lockdown or mandatory signature

2020-06-18 Thread lijiang

在 2020年06月18日 03:37, Andrew Morton 写道:
> On Tue,  2 Jun 2020 12:59:52 +0800 Lianbo Jiang  wrote:
> 
>> Signature verification is an important security feature, to protect
>> system from being attacked with a kernel of unknown origin. Kexec
>> rebooting is a way to replace the running kernel, hence need be
>> secured carefully.
> 
> I'm finding this changelog quite hard to understand,
> 
Thanks for your comment.

I will improve the patch log and try to make it easily understand.

>> In the current code of handling signature verification of kexec kernel,
>> the logic is very twisted. It mixes signature verification, IMA signature
>> appraising and kexec lockdown.
>>
>> If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
>> signature, the supported crypto, and key, we don't think this is wrong,
> 
> I think this is saying that in the absence of KEXEC_SIG_FORCE and if
> the signature/crypto/key are all incorrect, the kexec still succeeds,
> but it should not.
> 
When the KEXEC_SIG_FORCE is not enabled, even if kexec kernel image doesn't
have the signature, or the key, etc, kexec should be still allowed to loaded,
unless kexec lockdown is executed.

>> Unless kexec lockdown is executed. IMA is considered as another kind of
>> signature appraising method.
>>
>> If kexec kernel image has signature/crypto/key, it has to go through the
>> signature verification and pass. Otherwise it's seen as verification
>> failure, and won't be loaded.
> 
> I don't know if this is describing the current situation or the
> post-patch situation.
> 
This is the current situation, and we'd like to change it so that kexec allows
the kernel and initrd images to be loaded when they are not the lockdown or 
mandatory signature.

>> Seems kexec kernel image with an unqualified signature is even worse than
>> those w/o signature at all, this sounds very unreasonable. E.g. If people
>> get a unsigned kernel to load, or a kernel signed with expired key, which
>> one is more dangerous?
>>
>> So, here, let's simplify the logic to improve code readability. If the
>> KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature verification
>> is mandated. Otherwise, we lift the bar for any kernel image.
> 
> I think the whole thing needs a rewrite.  Start out by fully describing
> the current situation.  THen describe what is wrong with it, and why. 
> Then describe the proposed change.  Or something along these lines.
> 
> The changelog should also make clear the end-user impact of the patch. 
> In sufficient detail for others to decide which kernel version(s)
> should be patched.  Your recommendations will also be valuable - which
> kernel version(s) do you think should be patched, and why?
> 

Currently, kernel will always verify the signature without the lockdown or
mandatory signature. This may prevent the kernel from loading the kernel and
initrd images via the kexec_file_load() syscall. However, we'd like to allow
to still load the images in such case rather than failure due to the signature
verification issue.

For example, at the stage of development and test, usually use a signature
key to test whether the procedure of signature can work well as expected.
Sometimes, the signing time may be expired, but still use the kernel with
the old signature key to reproduce some problems in some automatic tests,
which always caused the failure of loading images.

Let's clean the logic of kernel code and allow to still load the kernel and
initrd images without the lockdown or mandatory signature.


Hope this helps.

Thanks.
Lianbo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec: Do not verify the signature without the lockdown or mandatory signature

2020-06-10 Thread lijiang


I just noticed that I forgot to add Eric Biederman in cc list, so sorry for 
this.

Thanks.
Lianbo

在 2020年06月02日 12:59, Lianbo Jiang 写道:
> Signature verification is an important security feature, to protect
> system from being attacked with a kernel of unknown origin. Kexec
> rebooting is a way to replace the running kernel, hence need be
> secured carefully.
> 
> In the current code of handling signature verification of kexec kernel,
> the logic is very twisted. It mixes signature verification, IMA signature
> appraising and kexec lockdown.
> 
> If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
> signature, the supported crypto, and key, we don't think this is wrong,
> Unless kexec lockdown is executed. IMA is considered as another kind of
> signature appraising method.
> 
> If kexec kernel image has signature/crypto/key, it has to go through the
> signature verification and pass. Otherwise it's seen as verification
> failure, and won't be loaded.
> 
> Seems kexec kernel image with an unqualified signature is even worse than
> those w/o signature at all, this sounds very unreasonable. E.g. If people
> get a unsigned kernel to load, or a kernel signed with expired key, which
> one is more dangerous?
> 
> So, here, let's simplify the logic to improve code readability. If the
> KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature verification
> is mandated. Otherwise, we lift the bar for any kernel image.
> 
> Signed-off-by: Lianbo Jiang 
> ---
> Changes since v1:
> [1] Modify the log level(suggested by Jiri Bohac)
> 
>  kernel/kexec_file.c | 34 ++
>  1 file changed, 6 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index faa74d5f6941..fae496958a68 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -181,34 +181,19 @@ void kimage_file_post_load_cleanup(struct kimage *image)
>  static int
>  kimage_validate_signature(struct kimage *image)
>  {
> - const char *reason;
>   int ret;
>  
>   ret = arch_kexec_kernel_verify_sig(image, image->kernel_buf,
>  image->kernel_buf_len);
> - switch (ret) {
> - case 0:
> - break;
> + if (ret) {
>  
> - /* Certain verification errors are non-fatal if we're not
> -  * checking errors, provided we aren't mandating that there
> -  * must be a valid signature.
> -  */
> - case -ENODATA:
> - reason = "kexec of unsigned image";
> - goto decide;
> - case -ENOPKG:
> - reason = "kexec of image with unsupported crypto";
> - goto decide;
> - case -ENOKEY:
> - reason = "kexec of image with unavailable key";
> - decide:
>   if (IS_ENABLED(CONFIG_KEXEC_SIG_FORCE)) {
> - pr_notice("%s rejected\n", reason);
> + pr_notice("Enforced kernel signature verification 
> failed (%d).\n", ret);
>   return ret;
>   }
>  
> - /* If IMA is guaranteed to appraise a signature on the kexec
> + /*
> +  * If IMA is guaranteed to appraise a signature on the kexec
>* image, permit it even if the kernel is otherwise locked
>* down.
>*/
> @@ -216,17 +201,10 @@ kimage_validate_signature(struct kimage *image)
>   security_locked_down(LOCKDOWN_KEXEC))
>   return -EPERM;
>  
> - return 0;
> -
> - /* All other errors are fatal, including nomem, unparseable
> -  * signatures and signature check failures - even if signatures
> -  * aren't required.
> -  */
> - default:
> - pr_notice("kernel signature verification failed (%d).\n", ret);
> + pr_debug("kernel signature verification failed (%d).\n", ret);
>   }
>  
> - return ret;
> + return 0;
>  }
>  #endif
>  
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Do not verify the signature without the lockdown or mandatory signature

2020-05-26 Thread lijiang

在 2020年05月27日 11:15, lijiang 写道:
> 在 2020年05月26日 21:59, Jiri Bohac 写道:
>> On Mon, May 25, 2020 at 01:23:51PM +0800, Lianbo Jiang wrote:
>>> So, here, let's simplify the logic to improve code readability. If the
>>> KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature verification
>>> is mandated. Otherwise, we lift the bar for any kernel image.
>>
>> I agree completely; in fact that was my intention when
>> introducing the code, but I got overruled about the return codes:
>> https://lore.kernel.org/lkml/20180119125425.l72meyyc2qtrr...@dwarf.suse.cz/
>>
>> I like this simplification very much, except this part:
>>
>>> +   if (ret) {
>>> +   pr_debug("kernel signature verification failed (%d).\n", ret);
>>
>> ...
>>
>>> -   pr_notice("kernel signature verification failed (%d).\n", ret);
>>
>> I think the log level should stay at most PR_NOTICE when the
>> verification failure results in rejecting the kernel. Perhaps
>> even lower.
>>
> 
> Thank you for the comment, Jiri Bohac.
> 
> I like the idea of staying at most PR_NOTICE, but the pr_notice() will output
> some messages that kernel could want to ignore, such as the case you mentioned
> below.
> 
>> In case verification is not enforced and the failure is
>> ignored, KERN_DEBUG seems reasonable.
>>
> 
> Yes, good understanding. It seems that the pr_debug() is still a good option 
> here?
> Any other thoughts?
> 

Or the following change looks better? What's your opinion?

static int
kimage_validate_signature(struct kimage *image)
{
int ret;

ret = arch_kexec_kernel_verify_sig(image, image->kernel_buf,
   image->kernel_buf_len);
if (ret) {

if (IS_ENABLED(CONFIG_KEXEC_SIG_FORCE)) {
pr_notice("Enforced kernel signature verification 
failed (%d).\n", ret);
return ret;
}

/*
 * If IMA is guaranteed to appraise a signature on the kexec
 * image, permit it even if the kernel is otherwise locked
 * down.
 */
if (!ima_appraise_signature(READING_KEXEC_IMAGE) &&
security_locked_down(LOCKDOWN_KEXEC))
return -EPERM;

pr_debug("kernel signature verification failed (%d).\n", ret);
}

return 0;
}


Thanks.
Lianbo

> Thanks.
> Lianbo
> 
> 
>> Regards,
>>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Do not verify the signature without the lockdown or mandatory signature

2020-05-26 Thread lijiang

在 2020年05月26日 21:59, Jiri Bohac 写道:
> On Mon, May 25, 2020 at 01:23:51PM +0800, Lianbo Jiang wrote:
>> So, here, let's simplify the logic to improve code readability. If the
>> KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature verification
>> is mandated. Otherwise, we lift the bar for any kernel image.
> 
> I agree completely; in fact that was my intention when
> introducing the code, but I got overruled about the return codes:
> https://lore.kernel.org/lkml/20180119125425.l72meyyc2qtrr...@dwarf.suse.cz/
> 
> I like this simplification very much, except this part:
> 
>> +if (ret) {
>> +pr_debug("kernel signature verification failed (%d).\n", ret);
> 
> ...
> 
>> -pr_notice("kernel signature verification failed (%d).\n", ret);
> 
> I think the log level should stay at most PR_NOTICE when the
> verification failure results in rejecting the kernel. Perhaps
> even lower.
> 

Thank you for the comment, Jiri Bohac.

I like the idea of staying at most PR_NOTICE, but the pr_notice() will output
some messages that kernel could want to ignore, such as the case you mentioned
below.

> In case verification is not enforced and the failure is
> ignored, KERN_DEBUG seems reasonable.
> 

Yes, good understanding. It seems that the pr_debug() is still a good option 
here?
Any other thoughts?

Thanks.
Lianbo


> Regards,
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec: support parsing the string "Reserved" to get the correct e820 reserved region

2020-02-24 Thread lijiang

在 2020年02月24日 14:48, Baoquan He 写道:
> On 02/24/20 at 02:36pm, Lianbo Jiang wrote:
>> When loading kernel and initramfs for kexec, kexec-tools could get the
>> e820 reserved region from "/proc/iomem" in order to rebuild the e820
>> ranges for kexec kernel, but there may be the string "Reserved" in the
>> "/proc/iomem", which caused the failure of parsing. For example:
>>
>>  #cat /proc/iomem|grep -i reserved
>> -0fff : Reserved
>> 7f338000-7f34dfff : Reserved
>> 7f3cd000-8fff : Reserved
>> f17f-f17f1fff : Reserved
>> fe00- : Reserved
> 
> This looks good to me. However, is it investigated why there are two
> different names for reserved e820 regions? Can we unify them with one
> name in kernel, 'Reserved' or 'reserved'?
> 
Thanks for your comment.

As we discussed in IRC, for the kexec-tools, we have to consider the
compatibility because of an old "reserved" and a new "Reserved".
Please refer to this commit: 640e1b38b005 ("x86/boot/e820: Basic cleanup
of e820.c")

In addition, I will check kernel code carefully to see if it needs to be
fixed in upstream.

Thanks.
Lianbo
> 
>>
>> Currently, kexec-tools can not handle the above case because the memcmp()
>> is case sensitive when comparing the string.
>>
>> So, let's fix this corner and make sure that the string "reserved" and
>> "Reserved" in the "/proc/iomem" are both parsed appropriately.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> Note:
>> Please follow up this commit below about kdump fix.
>> 1ac3e4a57000 ("kdump: fix an error that can not parse the e820 reserved 
>> region")
>>
>> Changes since v1:
>> [1] use strncasecmp() instead of introducing another 'else-if'(
>> suggested by Bhupesh)
>>
>>  kexec/arch/i386/kexec-x86-common.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kexec/arch/i386/kexec-x86-common.c 
>> b/kexec/arch/i386/kexec-x86-common.c
>> index 61ea19380ab2..9303704a0714 100644
>> --- a/kexec/arch/i386/kexec-x86-common.c
>> +++ b/kexec/arch/i386/kexec-x86-common.c
>> @@ -90,7 +90,7 @@ static int get_memory_ranges_proc_iomem(struct 
>> memory_range **range, int *ranges
>>  if (memcmp(str, "System RAM\n", 11) == 0) {
>>  type = RANGE_RAM;
>>  }
>> -else if (memcmp(str, "reserved\n", 9) == 0) {
>> +else if (strncasecmp(str, "reserved\n", 9) == 0) {
>>  type = RANGE_RESERVED;
>>  }
>>  else if (memcmp(str, "ACPI Tables\n", 12) == 0) {
>> -- 
>> 2.17.1
>>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-18 Thread lijiang

Hi, John Ogness

Thank you for improving the patch series and making great efforts.

I'm not sure if I missed anything else. Or are there any other related patches 
to be applied?

After applying this patch series, NMI watchdog detected a hard lockup, which 
caused that kernel can not boot, please refer to
the following call trace. And I put the complete kernel log in the attachment.

Test machine: 
Intel Platform: Grantley-R Wildcat Pass CPU: Broadwell-EP, B0
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
65536 MB memory, 800 GB disk space

kernel: v5.5-rc7
commit: def9d2780727 ("Linux 5.5-rc7")

..
[  OK  ] Started udev Coldplug all Devices.
[   42.110978] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
[   42.110978] Modules linked in: ip_tables xfs libcrc32c sr_mod cdrom sd_mod 
sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
drm_vram_helper drm_ttm_helper ttm ahci libahci ixgbe drm crc32c_intel libata 
mdio dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod
[   42.110986] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.110986] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.110987] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1c0
[   42.110988] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 
09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 
f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[   42.110988] RSP: 0018:bbe207a7bc48 EFLAGS: 0002
[   42.110989] RAX: 00f80101 RBX: a1576e80 RCX: 
[   42.110990] RDX:  RSI:  RDI: a1e95660
[   42.110990] RBP:  R08:  R09: 000b
[   42.110991] R10: a075df5dcf80 R11: a0ebfda0 R12: a1e95660
[   42.110991] R13: a1e97680 R14: a17197a0 R15: 0047
[   42.110991] FS:  7f7c5642a980() GS:a075df5c() 
knlGS:
[   42.110992] CS:  0010 DS:  ES:  CR0: 80050033
[   42.110992] CR2: 7ffe95f4c4c0 CR3: 00084fbfc004 CR4: 003606e0
[   42.110993] DR0:  DR1:  DR2: 
[   42.110993] DR3:  DR6: fffe0ff0 DR7: 0400
[   42.110993] Call Trace:
[   42.110993]  _raw_spin_lock+0x1a/0x20
[   42.110994]  console_unlock+0x9e/0x450
[   42.110994]  bust_spinlocks+0x16/0x30
[   42.110994]  oops_end+0x33/0xc0
[   42.110995]  general_protection+0x32/0x40
[   42.110995] RIP: 0010:copy_data+0xf2/0x1e0
[   42.110995] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 
cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 
00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
[   42.110996] RSP: 0018:bbe207a7bd80 EFLAGS: 00010002
[   42.110996] RAX: a075d44ca000 RBX: 00a8 RCX: fff000b0
[   42.110997] RDX: 00a8 RSI: 0f01 RDI: a1456e00
[   42.110997] RBP: 0801364600307073 R08: 2000 R09: 0801364600307073
[   42.110997] R10: fff0 R11: 00a8 R12: a1e98330
[   42.110998] R13: d7efbe00 R14: 00a8 R15: c000
[   42.110998]  _prb_read_valid+0xd8/0x190
[   42.110998]  prb_read_valid+0x15/0x20
[   42.110999]  devkmsg_read+0x9d/0x2a0
[   42.110999]  vfs_read+0x91/0x140
[   42.110999]  ksys_read+0x59/0xd0
[   42.111000]  do_syscall_64+0x55/0x1b0
[   42.111000]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   42.111000] RIP: 0033:0x7f7c55740b62
[   42.111001] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 
80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 
f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[   42.111001] RSP: 002b:7ffe95f4c4a8 EFLAGS: 0246 ORIG_RAX: 

[   42.111002] RAX: ffda RBX: 7ffe95f4e500 RCX: 7f7c55740b62
[   42.111002] RDX: 2000 RSI: 7ffe95f4c4b0 RDI: 0008
[   42.111002] RBP:  R08: 0100 R09: 0003
[   42.111003] R10: 0100 R11: 0246 R12: 7ffe95f4c4b0
[   42.111003] R13: 7ffe95f4e910 R14:  R15: 
[   42.111003] Kernel panic - not syncing: Hard LOCKUP
[   42.111004] Shutting down cpus with NMI
[   42.111004] Kernel Offset: 0x1f00 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[   42.111005] general protection fault:  [#1] SMP PTI
[   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
[   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 
cd 41 89 d6 44 89 44

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-18 Thread lijiang

在 2020年01月29日 00:19, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> Changes that exist due to the ringbuffer replacement:
> 
> - The VMCOREINFO has been updated for the new structures.
> 
> - Dictionary data is now stored in a separate data buffer from the
>   human-readable messages. The dictionary data buffer is set to the
>   same size as the message buffer. Therefore, the total reserved
>   memory for messages is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the
>   initial static buffer and 2x the specified size in the log_buf_len
>   kernel parameter.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>   the dynamic array.
> 
> Signed-off-by: John Ogness 
> ---
>  include/linux/kmsg_dump.h |   2 -
>  kernel/printk/Makefile|   1 +
>  kernel/printk/printk.c| 836 +++---
>  3 files changed, 416 insertions(+), 423 deletions(-)
> 
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 2e7a1e032c71..ae6265033e31 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -46,8 +46,6 @@ struct kmsg_dumper {
>   bool registered;
>  
>   /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
>   u64 cur_seq;
>   u64 next_seq;
>  };
> diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
> index 4d052fc6bcde..eee3dc9b60a9 100644
> --- a/kernel/printk/Makefile
> +++ b/kernel/printk/Makefile
> @@ -2,3 +2,4 @@
>  obj-y= printk.o
>  obj-$(CONFIG_PRINTK) += printk_safe.o
>  obj-$(CONFIG_A11Y_BRAILLE_CONSOLE)   += braille.o
> +obj-$(CONFIG_PRINTK) += printk_ringbuffer.o
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1ef6f75d92f1..d0d24ee1d1f4 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -56,6 +56,7 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#include "printk_ringbuffer.h"
>  #include "console_cmdline.h"
>  #include "braille.h"
>  #include "internal.h"
> @@ -294,30 +295,22 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> - *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> - *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> + * Every record meta-data carries the monotonic timestamp in microseconds, as
> + * well as the standard userspace syslog level and syslog facility. The usual
>   * kernel messages use LOG_KERN; userspace-injected messages always carry
>   * a matching syslog facility, by default LOG_USER. The origin of every
>   * message can be reliably determined that way.
>   *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> + * The human readable log message of a record is available in @text, the 
> length
> + * of the message text in @text_len. The stored message is not terminated.
>   *
> - * Optionally, a message can carry a dictionary of properties (key/value 
> pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value 
> pairs),
> + * to provide userspace with a machine-readable message context. The length 
> of
> + * the dictionary is available in @dict_len. The dictionary is not 
> terminated.
>   *
>   * Examples for well-defined, commonly used property names are:
>   *   DEVICE=b12:8   device identifier
> @@ -331,21 +324,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
>   * follows directly after a '=' character. Every property is terminated by
>   * a '\0' character. The last property is not terminated.
>   *
> - * Example of a message structure:
> - *     ff 8f 00 00 00 00 00 00  monotonic time in nsec
>

Re: [PATCH 2/2] printk: use the lockless ringbuffer

2020-02-14 Thread lijiang

在 2020年02月14日 21:50, John Ogness 写道:
> Hi Lianbo,
> 
> On 2020-02-14, lijiang  wrote:
>>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>>> index 1ef6f75d92f1..d0d24ee1d1f4 100644
>>> --- a/kernel/printk/printk.c
>>> +++ b/kernel/printk/printk.c
>>> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>>>  {
>>> VMCOREINFO_SYMBOL(log_buf);
>>> VMCOREINFO_SYMBOL(log_buf_len);
>>
>> I notice that the "prb"(printk tb static) symbol is not exported into
>> vmcoreinfo as follows:
>>
>> +VMCOREINFO_SYMBOL(prb);
>>
>> Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
>> Otherwise, do you happen to know how to walk through the log_buf and
>> get all kernel logs from vmcore?
> 
> You are correct. This will need to be exported as well so that the
> descriptors can be accessed. (log_buf is only the pure human-readable

Really agree, and I guess that there may be more structures and their offsets
to be exported, for example: struct prb_desc_ring, struct prb_data_ring, and
struct prb_desc, etc.

This makes sure that tools(such as makedumpfile and crash) can appropriately
access them. 

> text.) I am currently hacking the crash tool to see exactly what needs
> to be made available in order to access all the data of the ringbuffer.
> 
It makes sense and avoids exporting unnecessary symbols and offsets.

Thanks.
Lianbo


> John Ogness
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: support parsing the string "Reserved" to get the correct e820 reserved region

2020-02-13 Thread lijiang

在 2020年02月13日 03:46, Bhupesh Sharma 写道:
> Hi Lianbo,
> 
> Thanks for the patch.
> 
> On Wed, Feb 12, 2020 at 6:27 PM Lianbo Jiang  wrote:
>>
>> When loading kernel and initramfs for kexec, kexec-tools could get the
>> e820 reserved region from "/proc/iomem" in order to rebuild the e820
>> ranges for kexec kernel, but there may be the string "Reserved" in the
>> "/proc/iomem", which caused the failure of parsing. For example:
>>
>>  #cat /proc/iomem|grep -i reserved
>> -0fff : Reserved
>> 7f338000-7f34dfff : Reserved
>> 7f3cd000-8fff : Reserved
>> f17f-f17f1fff : Reserved
>> fe00- : Reserved
>>
>> Currently, kexec-tools can not handle the above case because the memcmp()
>> is case sensitive when comparing the string.
>>
>> So, let's fix this corner and make sure that the string "reserved" and
>> "Reserved" in the "/proc/iomem" are both parsed appropriately.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> Note:
>> Please follow up this commit below about kdump fix.
>> 1ac3e4a57000 ("kdump: fix an error that can not parse the e820 reserved 
>> region")
>>
>>  kexec/arch/i386/kexec-x86-common.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kexec/arch/i386/kexec-x86-common.c 
>> b/kexec/arch/i386/kexec-x86-common.c
>> index 61ea19380ab2..86bcc8c0677e 100644
>> --- a/kexec/arch/i386/kexec-x86-common.c
>> +++ b/kexec/arch/i386/kexec-x86-common.c
>> @@ -93,6 +93,9 @@ static int get_memory_ranges_proc_iomem(struct 
>> memory_range **range, int *ranges
>> else if (memcmp(str, "reserved\n", 9) == 0) {
>> type = RANGE_RESERVED;
>> }
>> +   else if (memcmp(str, "Reserved\n", 9) == 0) {
>> +   type = RANGE_RESERVED;
>> +   }
> 
> Instead of introducing another 'else-if' case here, can we use
> strncasecmp() instead.
> 
> It  compares the two input strings (say s1 and s2), ignoring the case
> of the characters. Also it only compares the first n bytes of s1 (so
> the format is the same as memcmp).
> 
> In this way, we can be sure to future-proof the kexec-tools code check
> from future notation of the "Reserved" field in terms of the case used
> to denote the "Reserved" string.
> 
> What's your view on the same?
> 
Thanks for your comment, Bhupesh.

I have no preference about this, both are good to me. 

If no one disagrees with this suggestion, I will change to the strncasecmp()
and post v2 later. Any other reviewers?

Lianbo

> Regards,
> Bhupesh
> 
>> else if (memcmp(str, "ACPI Tables\n", 12) == 0) {
>> type = RANGE_ACPI;
>> }
>> --
>> 2.17.1
>>
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-06 Thread lijiang

在 2020年01月29日 00:19, John Ogness 写道:
> Hello,
> 
> After several RFC series [0][1][2][3][4], here is the first set of
> patches to rework the printk subsystem. This first set of patches
> only replace the existing ringbuffer implementation. No locking is
> removed. No semantics/behavior of printk are changed.
> 
> The VMCOREINFO is updated, which will require changes to the
> external crash [5] tool. I will be preparing a patch to add support
> for the new VMCOREINFO.
> 
In addition to changing the crash utility, I would think that the
kexec-tools(such as the vmcore-dmesg and makedumpfile) also need to
be modified accordingly.

Thanks
Lianbo

> This series is in line with the agreements [6] made at the meeting
> during LPC2019 in Lisbon, with 1 exception: support for dictionaries
> will _not_ be discontinued [7]. Dictionaries are stored in a separate
> buffer so that they cannot interfere with the human-readable buffer.
> 
> John Ogness
> 
> [0] https://lkml.kernel.org/r/20190212143003.48446-1-john.ogn...@linutronix.de
> [1] https://lkml.kernel.org/r/20190607162349.18199-1-john.ogn...@linutronix.de
> [2] https://lkml.kernel.org/r/2019072701.11260-1-john.ogn...@linutronix.de
> [3] https://lkml.kernel.org/r/20190807222634.1723-1-john.ogn...@linutronix.de
> [4] https://lkml.kernel.org/r/20191128015235.12940-1-john.ogn...@linutronix.de
> [5] https://github.com/crash-utility/crash
> [6] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de
> [7] https://lkml.kernel.org/r/20191007120134.ciywr3wale4gx...@pathway.suse.cz
> 
> John Ogness (2):
>   printk: add lockless buffer
>   printk: use the lockless ringbuffer
> 
>  include/linux/kmsg_dump.h |2 -
>  kernel/printk/Makefile|1 +
>  kernel/printk/printk.c|  836 +-
>  kernel/printk/printk_ringbuffer.c | 1370 +
>  kernel/printk/printk_ringbuffer.h |  328 +++
>  5 files changed, 2114 insertions(+), 423 deletions(-)
>  create mode 100644 kernel/printk/printk_ringbuffer.c
>  create mode 100644 kernel/printk/printk_ringbuffer.h
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-06 Thread lijiang

> On 2020-02-05, lijiang  wrote:
>> Do you have any suggestions about the size of CONFIG_LOG_* and
>> CONFIG_PRINTK_* options by default?
> 
> The new printk implementation consumes more than double the memory that
> the current printk implementation requires. This is because dictionaries
> and meta-data are now stored separately.
> 
> If the old defaults (LOG_BUF_SHIFT=17 LOG_CPU_MAX_BUF_SHIFT=12) were
> chosen because they are maximally acceptable defaults, then the defaults
> should be reduced by 1 so that the final size is "similar" to the
> current implementation.
> 
> If instead the defaults are left as-is, a machine with less than 64 CPUs
> will reserve 336KiB for printk information (128KiB text, 128KiB
> dictionary, 80KiB meta-data).
> 
> It might also be desirable to reduce the dictionary size (maybe 1/4 the
> size of text?). However, since the new printk implementation allows for
> non-intrusive dictionaries, we might see their usage increase and start
> to be as large as the messages themselves.
> 
> John Ogness
> 

Thanks for the explanation in detail.

Lianbo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-05 Thread lijiang

在 2020年02月05日 23:48, John Ogness 写道:
> On 2020-02-05, Sergey Senozhatsky  wrote:
>> 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
>> 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
> 
> The problem was due to an uninitialized pointer.
> 
> Very recently the ringbuffer API was expanded so that it could
> optionally count lines in a record. This made it possible for me to
> implement record_print_text_inline(), which can do all the kmsg_dump
> multi-line madness without requiring a temporary buffer. Rather than
> passing an extra argument around for the optional line count, I added
> the text_line_count pointer to the printk_record struct. And since line
> counting is rarely needed, it is only performed if text_line_count is
> non-NULL.
> 
> I oversaw that devkmsg_open() setup a printk_record and so I did not see
> to add the extra NULL initialization of text_line_count. There should be
> be an initializer function/macro to avoid this danger.
> 
Good findings. Thanks for the quick fixup, it works well.

Lianbo

> John Ogness
> 
> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d0d24ee1d1f4..5ad67ff60cd9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file 
> *file)
>   user->record.text_buf_size = sizeof(user->text_buf);
>   user->record.dict_buf = >dict_buf[0];
>   user->record.dict_buf_size = sizeof(user->dict_buf);
> + user->record.text_line_count = NULL;
>  
>   logbuf_lock_irq();
>   user->seq = prb_first_seq(prb);
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-05 Thread lijiang

> On 2020-02-05, Sergey Senozhatsky  wrote:
> So there is a General protection fault. That's the type of a
> problem that kills the boot for me as well (different backtrace,
> tho).

 Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR)
 enabled?
>>>
>>> Yes. These two options are enabled.
>>>
>>> CONFIG_RELOCATABLE=y
>>> CONFIG_RANDOMIZE_BASE=y
>>
>> So KASLR kills the boot for me. So does KASAN.
> 
> Sergey, thanks for looking into this already!
> 
>> John, do you see any of these problems on your test machine?
> 
> For x86 I have only been using qemu. (For hardware tests I use arm64-smp
> in order to verify memory barriers.) With qemu-x86_64 I am unable to
> reproduce the problem.
> 
> Lianbo, thanks for the report. Can you share your boot args? Anything
> special in there (like log_buf_len=, earlyprintk, etc)?
> 
Thanks for your response. Here is my kernel command line:

Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.0-rc7+ 
root=/dev/mapper/intel--wildcatpass--07-root ro crashkernel=512M 
resume=/dev/mapper/intel--wildcatpass--07-swap 
rd.lvm.lv=intel-wildcatpass-07/root rd.lvm.lv=intel-wildcatpass-07/swap 
console=ttyS0,115200n81

BTW: Actually, I put the complete kernel log in my last email reply, you could 
check the attachment if needed.

> Also, could you share your CONFIG_LOG_* and CONFIG_PRINTK_* options?
> 
Sure. Please refer to it.

[root@intel-wildcatpass-07 linux]# grep -nr "CONFIG_LOG_" .config 
134:CONFIG_LOG_BUF_SHIFT=20
135:CONFIG_LOG_CPU_MAX_BUF_SHIFT=12

[root@intel-wildcatpass-07 linux]# grep -nr "CONFIG_PRINTK_" .config 
136:CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
207:CONFIG_PRINTK_NMI=y
7758:CONFIG_PRINTK_TIME=y
7759:# CONFIG_PRINTK_CALLER is not set

Do you have any suggestions about the size of CONFIG_LOG_* and CONFIG_PRINTK_* 
options by default?

Thanks.
Lianbo

> I will move to bare metal x86_64 and hopefully see it as well.
> 
> John
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-05 Thread lijiang

> On (20/02/05 13:38), lijiang wrote:
>>> On (20/02/05 13:48), Sergey Senozhatsky wrote:
>>>> On (20/02/05 12:25), lijiang wrote:
> 
> [..]
> 
>>>>
>>>> So there is a General protection fault. That's the type of a problem that
>>>> kills the boot for me as well (different backtrace, tho).
>>>
>>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
>>>
>>
>> Yes. These two options are enabled.
>>
>> CONFIG_RELOCATABLE=y
>> CONFIG_RANDOMIZE_BASE=y
> 
> So KASLR kills the boot for me. So does KASAN.
> 
For my side, after adding the option 'nokaslr' to kernel command line, I still 
have the
previously mentioned problem, finally, kernel failed to boot.

Thanks.

> John, do you see any of these problems on your test machine?
> 
>   -ss
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread lijiang



> On (20/02/05 13:48), Sergey Senozhatsky wrote:
>> On (20/02/05 12:25), lijiang wrote:
>> [..]
>>> [   42.111004] Kernel Offset: 0x1f00 from 0x8100 
>>> (relocation range: 0x8000-0xbfff)
>>> [   42.111005] general protection fault:  [#1] SMP PTI
>>> [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 
>>> 5.5.0-rc7+ #4
>>> [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
>>> SE5C610.86B.01.01.6024.071720181717 07/17/2018
>>> [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
>>> [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 
>>> 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 
>>> <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
>>> [   42.111007] RSP: 0018:bbe207a7bd80 EFLAGS: 00010002
>>> [   42.111007] RAX: a075d44ca000 RBX: 00a8 RCX: 
>>> fff000b0
>>> [   42.111008] RDX: 00a8 RSI: 0f01 RDI: 
>>> a1456e00
>>> [   42.111008] RBP: 0801364600307073 R08: 2000 R09: 
>>> 0801364600307073
>>> [   42.111008] R10: fff0 R11: 00a8 R12: 
>>> a1e98330
>>> [   42.111009] R13: d7efbe00 R14: 00a8 R15: 
>>> c000
>>> [   42.111009] FS:  7f7c5642a980() GS:a075df5c() 
>>> knlGS:
>>> [   42.111010] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   42.111010] CR2: 7ffe95f4c4c0 CR3: 00084fbfc004 CR4: 
>>> 003606e0
>>> [   42.111011] DR0:  DR1:  DR2: 
>>> 
>>> [   42.111011] DR3:  DR6: fffe0ff0 DR7: 
>>> 0400
>>> [   42.111012] Call Trace:
>>> [   42.111012]  _prb_read_valid+0xd8/0x190
>>> [   42.111012]  prb_read_valid+0x15/0x20
>>> [   42.111013]  devkmsg_read+0x9d/0x2a0
>>> [   42.111013]  vfs_read+0x91/0x140
>>> [   42.111013]  ksys_read+0x59/0xd0
>>> [   42.111014]  do_syscall_64+0x55/0x1b0
>>> [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [   42.111014] RIP: 0033:0x7f7c55740b62
>>> [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 
>>> 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 
>>> <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
>>> [   42.111015] RSP: 002b:7ffe95f4c4a8 EFLAGS: 0246 ORIG_RAX: 
>>> 
>>> [   42.111016] RAX: ffda RBX: 7ffe95f4e500 RCX: 
>>> 7f7c55740b62
>>> [   42.111016] RDX: 2000 RSI: 7ffe95f4c4b0 RDI: 
>>> 0008
>>> [   42.111017] RBP:  R08: 0100 R09: 
>>> 0003
>>> [   42.111017] R10: 0100 R11: 0246 R12: 
>>> 7ffe95f4c4b0
>>
>> So there is a General protection fault. That's the type of a problem that
>> kills the boot for me as well (different backtrace, tho).
> 
> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
> 

Yes. These two options are enabled.

CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y

Thanks.

>   -ss
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] makedumpfile/s390: Add get_kaslr_offset() for s390x

2019-12-30 Thread lijiang

> Hi Lianbo,
> 
>> -Original Message-
>> 在 2019年12月26日 11:38, lijiang 写道:
>>> Hi, Kazu and Mikhail,
>>>
>>>> Hi Mikhail,
>>>>
>>>>> -Original Message-
>>>>> Hi,
>>>>>
>>>>> On 12.12.2019 17:12, Kazuhito Hagio wrote:
>>>>>> Hi Mikhail,
>>>>>>
>>>>>>> -Original Message-
>>>>>>> Hello Kazu,
>>>>>>>
>>>>>>> I think we can try to generalize the kaslr offset extraction.
>>>>>>> I won't speak for other architectures, but for s390 that 
>>>>>>> get_kaslr_offset_arm64()
>>>>>>> should work fine. The only concern of mine is this TODO statement:
>>>>>>>
>>>>>>> if (_text <= vaddr && vaddr <= _end) {
>>>>>>> DEBUG_MSG("info->kaslr_offset: %lx\n", info->kaslr_offset);
>>>>>>> return info->kaslr_offset;
>>>>>>> } else {
>>>>>>> /*
>>>>>>> * TODO: we need to check if it is vmalloc/vmmemmap/module
>>>>>>> * address, we will have different offset
>>>>>>> */
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> Could you explain this one?
>>>>>>
>>>>>> Probably it was considered that the check would be needed to support
>>>>>> the whole KASLR behavior when get_kaslr_offset_x86_64() was written
>>>>>> originally.
>>>>>>
>>>>>> But in the current makedumpfile for x86_64 and arm64 supporting KASLR,
>>>>>> the offset we need is the one for symbol addresses in vmlinux only.
>>>>>> As I said below, module symbol addresses are retrieved from vmcore.
>>>>>> Other addresses should not be passed to the function for now, as far
>>>>>> as I know.
>>>>>>
>>>>>> So I think the TODO comment is confusing, and it would be better to
>>>>>> remove it or change it to something like:
>>>>>> /*
>>>>>>  * Returns 0 if vaddr does not need the offset to be added,
>>>>>>  * e.g. for module address.
>>>>>>  */
>>>>>>
>>>>>> But if s390 uses get_kaslr_offset() in its arch-specific code to
>>>>>> adjust addresses other than kernel text address, we might need to
>>>>>> modify it for s390, not generalize it.
>>>>>
>>>>> Currently, s390 doesn't use get_kaslr_offset() in its arch-specific
>>>>> code.
>>>>
>>>> OK, I pushed a patch that generalizes it to my test repository.
>>>> Could you enable s390 to use it and test?
>>>> https://github.com/k-hagio/makedumpfile/tree/add-get_kaslr_offset_general
>>>>
>>>
>>> I enabled it on s390 as below and tested, it worked.
> 
> Thank you for testing.
> 
It's my pleasure. 

>>>
>>> @@ -1075,7 +1075,7 @@ int is_iomem_phys_addr_s390x(unsigned long addr);
>>>  #define get_phys_base()stub_true()
>>>  #define get_machdep_info() get_machdep_info_s390x()
>>>  #define get_versiondep_info()  stub_true()
>>> -#define get_kaslr_offset(X)stub_false()
>>> +#define get_kaslr_offset(X)get_kaslr_offset_general(X)
>>>  #define vaddr_to_paddr(X)  vaddr_to_paddr_s390x(X)
>>>
>>> But, there is still a problem that needs to be improved. In the 
>>> find_kaslr_offsets(),
>>> the value of SYMBOL(_stext) is always 0(zero) and it is passed to the 
>>> get_kaslr_offset().
>>> For the following code in the get_kaslr_offset_general(), it does not work 
>>> as expected.
>>> ...
>>> if (_text <= vaddr && vaddr <= _end)
>>> return info->kaslr_offset;
>>> else
>>> return 0;
> 
> I don't know why the SYMBOL(_stext) is passed to the get_kaslr_offset() 
> there, but
> since the return value of get_kaslr_offset() is not used in the 
> find_kaslr_offsets(),
> it's meaningless and not harmful. So it is not worth doing 
> READ_SYMBOL(_stext) there
> for now.
> 
Sounds good.

>>
>> In addition, the above code confused me, it will always return 0 on 
>> s390(please refer to my log

Re: [PATCH] makedumpfile/s390: Add get_kaslr_offset() for s390x

2019-12-25 Thread lijiang

在 2019年12月26日 11:38, lijiang 写道:
> Hi, Kazu and Mikhail,
> 
>> Hi Mikhail,
>>
>>> -Original Message-
>>> Hi,
>>>
>>> On 12.12.2019 17:12, Kazuhito Hagio wrote:
>>>> Hi Mikhail,
>>>>
>>>>> -Original Message-
>>>>> Hello Kazu,
>>>>>
>>>>> I think we can try to generalize the kaslr offset extraction.
>>>>> I won't speak for other architectures, but for s390 that 
>>>>> get_kaslr_offset_arm64()
>>>>> should work fine. The only concern of mine is this TODO statement:
>>>>>
>>>>> if (_text <= vaddr && vaddr <= _end) {
>>>>>   DEBUG_MSG("info->kaslr_offset: %lx\n", info->kaslr_offset);
>>>>>   return info->kaslr_offset;
>>>>>   } else {
>>>>>   /*
>>>>>   * TODO: we need to check if it is vmalloc/vmmemmap/module
>>>>>   * address, we will have different offset
>>>>>   */
>>>>>   return 0;
>>>>> }
>>>>>
>>>>> Could you explain this one?
>>>>
>>>> Probably it was considered that the check would be needed to support
>>>> the whole KASLR behavior when get_kaslr_offset_x86_64() was written
>>>> originally.
>>>>
>>>> But in the current makedumpfile for x86_64 and arm64 supporting KASLR,
>>>> the offset we need is the one for symbol addresses in vmlinux only.
>>>> As I said below, module symbol addresses are retrieved from vmcore.
>>>> Other addresses should not be passed to the function for now, as far
>>>> as I know.
>>>>
>>>> So I think the TODO comment is confusing, and it would be better to
>>>> remove it or change it to something like:
>>>> /*
>>>>  * Returns 0 if vaddr does not need the offset to be added,
>>>>  * e.g. for module address.
>>>>  */
>>>>
>>>> But if s390 uses get_kaslr_offset() in its arch-specific code to
>>>> adjust addresses other than kernel text address, we might need to
>>>> modify it for s390, not generalize it.
>>>
>>> Currently, s390 doesn't use get_kaslr_offset() in its arch-specific
>>> code.
>>
>> OK, I pushed a patch that generalizes it to my test repository.
>> Could you enable s390 to use it and test?
>> https://github.com/k-hagio/makedumpfile/tree/add-get_kaslr_offset_general
>>
> 
> I enabled it on s390 as below and tested, it worked.
> 
> @@ -1075,7 +1075,7 @@ int is_iomem_phys_addr_s390x(unsigned long addr);
>  #define get_phys_base()stub_true()
>  #define get_machdep_info() get_machdep_info_s390x()
>  #define get_versiondep_info()  stub_true()
> -#define get_kaslr_offset(X)stub_false()
> +#define get_kaslr_offset(X)get_kaslr_offset_general(X)
>  #define vaddr_to_paddr(X)  vaddr_to_paddr_s390x(X)
> 
> But, there is still a problem that needs to be improved. In the 
> find_kaslr_offsets(),
> the value of SYMBOL(_stext) is always 0(zero) and it is passed to the 
> get_kaslr_offset().
> For the following code in the get_kaslr_offset_general(), it does not work as 
> expected.
> ...
>   if (_text <= vaddr && vaddr <= _end)
>   return info->kaslr_offset;
>   else
>   return 0;

In addition, the above code confused me, it will always return 0 on s390(please 
refer to my logs).

Thanks.

> ...
> Here is my log:
> get_kaslr_offset_general: info->kaslr_offset: 67ebc000, _text:10, 
> _end:10ba000, vaddr:0
> 
> After applied the following patch, got the expected result.
>  int
>  find_kaslr_offsets()
>  {
> @@ -3973,6 +4042,11 @@ find_kaslr_offsets()
>  * called this function between open_vmcoreinfo() and
>  * close_vmcoreinfo()
>  */
> +   READ_SYMBOL("_stext", _stext);
> +   if (SYMBOL(_stext) == NOT_FOUND_SYMBOL) {
> +ERRMSG("Can't get the symbol of _stext.\n");
> +goto out;
> +   }
> get_kaslr_offset(SYMBOL(_stext));
> 
> Here is my log:
> get_kaslr_offset_general: info->kaslr_offset: 67ebc000, _text:10, 
> _end:10ba000, vaddr:67fbc000
> 
> Basically, before using the value of SYMBOL(_stext), need to ensure that the 
> SYMBOL(_stext) is parsed
> correctly.
> 
> Thanks.
> 
>> Thanks,
>> Kazu
>>
>>>
>>>>
>>>> Thanks,
>>>> Kazu
>>>>
>>>>&

Re: [PATCH] makedumpfile/s390: Add get_kaslr_offset() for s390x

2019-12-25 Thread lijiang

Hi, Kazu and Mikhail,

> Hi Mikhail,
> 
>> -Original Message-
>> Hi,
>>
>> On 12.12.2019 17:12, Kazuhito Hagio wrote:
>>> Hi Mikhail,
>>>
 -Original Message-
 Hello Kazu,

 I think we can try to generalize the kaslr offset extraction.
 I won't speak for other architectures, but for s390 that 
 get_kaslr_offset_arm64()
 should work fine. The only concern of mine is this TODO statement:

 if (_text <= vaddr && vaddr <= _end) {
DEBUG_MSG("info->kaslr_offset: %lx\n", info->kaslr_offset);
return info->kaslr_offset;
} else {
/*
* TODO: we need to check if it is vmalloc/vmmemmap/module
* address, we will have different offset
*/
return 0;
 }

 Could you explain this one?
>>>
>>> Probably it was considered that the check would be needed to support
>>> the whole KASLR behavior when get_kaslr_offset_x86_64() was written
>>> originally.
>>>
>>> But in the current makedumpfile for x86_64 and arm64 supporting KASLR,
>>> the offset we need is the one for symbol addresses in vmlinux only.
>>> As I said below, module symbol addresses are retrieved from vmcore.
>>> Other addresses should not be passed to the function for now, as far
>>> as I know.
>>>
>>> So I think the TODO comment is confusing, and it would be better to
>>> remove it or change it to something like:
>>> /*
>>>  * Returns 0 if vaddr does not need the offset to be added,
>>>  * e.g. for module address.
>>>  */
>>>
>>> But if s390 uses get_kaslr_offset() in its arch-specific code to
>>> adjust addresses other than kernel text address, we might need to
>>> modify it for s390, not generalize it.
>>
>> Currently, s390 doesn't use get_kaslr_offset() in its arch-specific
>> code.
> 
> OK, I pushed a patch that generalizes it to my test repository.
> Could you enable s390 to use it and test?
> https://github.com/k-hagio/makedumpfile/tree/add-get_kaslr_offset_general
> 

I enabled it on s390 as below and tested, it worked.

@@ -1075,7 +1075,7 @@ int is_iomem_phys_addr_s390x(unsigned long addr);
 #define get_phys_base()stub_true()
 #define get_machdep_info() get_machdep_info_s390x()
 #define get_versiondep_info()  stub_true()
-#define get_kaslr_offset(X)stub_false()
+#define get_kaslr_offset(X)get_kaslr_offset_general(X)
 #define vaddr_to_paddr(X)  vaddr_to_paddr_s390x(X)

But, there is still a problem that needs to be improved. In the 
find_kaslr_offsets(),
the value of SYMBOL(_stext) is always 0(zero) and it is passed to the 
get_kaslr_offset().
For the following code in the get_kaslr_offset_general(), it does not work as 
expected.
...
if (_text <= vaddr && vaddr <= _end)
return info->kaslr_offset;
else
return 0;
...
Here is my log:
get_kaslr_offset_general: info->kaslr_offset: 67ebc000, _text:10, 
_end:10ba000, vaddr:0

After applied the following patch, got the expected result.
 int
 find_kaslr_offsets()
 {
@@ -3973,6 +4042,11 @@ find_kaslr_offsets()
 * called this function between open_vmcoreinfo() and
 * close_vmcoreinfo()
 */
+   READ_SYMBOL("_stext", _stext);
+   if (SYMBOL(_stext) == NOT_FOUND_SYMBOL) {
+ERRMSG("Can't get the symbol of _stext.\n");
+goto out;
+   }
get_kaslr_offset(SYMBOL(_stext));

Here is my log:
get_kaslr_offset_general: info->kaslr_offset: 67ebc000, _text:10, 
_end:10ba000, vaddr:67fbc000

Basically, before using the value of SYMBOL(_stext), need to ensure that the 
SYMBOL(_stext) is parsed
correctly.

Thanks.

> Thanks,
> Kazu
> 
>>
>>>
>>> Thanks,
>>> Kazu
>>>

 Thanks,
 Mikhail

 On 09.12.2019 23:02, Kazuhito Hagio wrote:
> Hi Mikhail,
>
> Sorry for late reply.
>
>> -Original Message-
>> Since kernel v5.2 KASLR is supported on s390. In makedumpfile however no
>> support has been added yet. This patch adds the arch specific function
>> get_kaslr_offset() for s390x.
>> Since the values in vmcoreinfo are already relocated, the patch is
>> mainly relevant for vmlinux processing (-x option).
>
> In the current implementation of makedumpfile, the get_kaslr_offset(vaddr)
> is supposed to return the KASLR offset only when the offset is needed to
> add to the vaddr.  So generally symbols from kernel (vmlinux) need it, but
> symbols from modules are resolved dynamically and don't need the offset.
 \>
> This patch always returns the offset if any, as a result, I guess this 
> patch
> will not work as expected with module symbols in filter config file.
>
> So... How about making get_kaslr_offset_arm64() general for other archs
> (get_kaslr_offset_general() or something), then using it also for s390?
> If OK, I can do that generalization.
>
> Thanks,
> Kazu
>
>>
>> Signed-off-by: Philipp Rudo 
>> Signed-off-by:

Re: [PATCH 3/3 v9] kexec: Fix i386 build warnings that missed declaration of struct kimage

2019-11-14 Thread lijiang

在 2019年11月14日 22:43, Borislav Petkov 写道:
> On Thu, Nov 14, 2019 at 10:20:42PM +0800, lijiang wrote:
>> I really saw my building result, but kbuild reported the following messages:
>>
>> vim +5 arch/x86/include/asm/crash.h
>>
>> dd5f726076cc76 Vivek Goyal 2014-08-08   4  
>> dd5f726076cc76 Vivek Goyal 2014-08-08  @5  int crash_load_segments(struct 
>> kimage *image);
>> dd5f726076cc76 Vivek Goyal 2014-08-08   6  int 
>> crash_copy_backup_region(struct kimage *image);
>> dd5f726076cc76 Vivek Goyal 2014-08-08   7  int 
>> crash_setup_memmap_entries(struct kimage *image,
>> dd5f726076cc76 Vivek Goyal 2014-08-08   8struct boot_params 
>> *params);
>> 89f579ce99f7e0 Yi Wang 2018-11-22   9  void crash_smp_send_stop(void);
>> dd5f726076cc76 Vivek Goyal 2014-08-08  10  
>>
>> :: The code at line 5 was first introduced by commit 
>>^
>> :: dd5f726076cc7639d9713b334c8c133f77c6757a kexec: support for kexec on 
>> panic using new system call
>>
>> 
> 
> You should not take the report of a bot blindly but should always double
> check it. Like every other computer system programmed by humans, it can
> make mistakes.
> 

Indeed, i totally agree.

>> Would you mind giving me any suggestions about this?
> 
> I'll take care of it all and push the results out soon.
> 

OK, thank you so much.

Lianbo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/3 v9] kexec: Fix i386 build warnings that missed declaration of struct kimage

2019-11-14 Thread lijiang

在 2019年11月14日 20:39, Borislav Petkov 写道:
> On Fri, Nov 08, 2019 at 05:00:27PM +0800, Lianbo Jiang wrote:
>> Kbuild test robot reported some build warnings as follow:
>>
>> arch/x86/include/asm/crash.h:5:32: warning: 'struct kimage' declared
>> inside parameter list will not be visible outside of this definition
>> or declaration
>> int crash_load_segments(struct kimage *image);
>>^~
>> int crash_copy_backup_region(struct kimage *image);
>> ^~
>> int crash_setup_memmap_entries(struct kimage *image,
>>   ^~
>> The 'struct kimage' is defined in the header file include/linux/kexec.h,
>> before using it, need to include its header file or make a declaration.
>> Otherwise the above warnings may be triggered.
>>
>> Add a declaration of struct kimage to the file arch/x86/include/asm/
>> crash.h, that will solve these compile warnings.
>>
>> Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system 
>> call")
> 
> This is, of course, wrong. Your *first* patch is introducing those
> warnings and I'm wondering how did you not see them during building?
> 

I really saw my building result, but kbuild reported the following messages:

vim +5 arch/x86/include/asm/crash.h

dd5f726076cc76 Vivek Goyal 2014-08-08   4  
dd5f726076cc76 Vivek Goyal 2014-08-08  @5  int crash_load_segments(struct 
kimage *image);
dd5f726076cc76 Vivek Goyal 2014-08-08   6  int crash_copy_backup_region(struct 
kimage *image);
dd5f726076cc76 Vivek Goyal 2014-08-08   7  int 
crash_setup_memmap_entries(struct kimage *image,
dd5f726076cc76 Vivek Goyal 2014-08-08   8   struct boot_params 
*params);
89f579ce99f7e0 Yi Wang 2018-11-22   9  void crash_smp_send_stop(void);
dd5f726076cc76 Vivek Goyal 2014-08-08  10  

:: The code at line 5 was first introduced by commit 
   ^
:: dd5f726076cc7639d9713b334c8c133f77c6757a kexec: support for kexec on 
panic using new system call
   


Would you mind giving me any suggestions about this?

> In file included from arch/x86/realmode/init.c:11:
> ./arch/x86/include/asm/crash.h:5:32: warning: ‘struct kimage’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> 5 | int crash_load_segments(struct kimage *image);
>   |^~
> ./arch/x86/include/asm/crash.h:6:37: warning: ‘struct kimage’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> 6 | int crash_copy_backup_region(struct kimage *image);
>   | ^~
> ./arch/x86/include/asm/crash.h:7:39: warning: ‘struct kimage’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> 7 | int crash_setup_memmap_entries(struct kimage *image,
>   |
> 
> 
> And that happens because you've included asm/crash.h in
> arch/x86/realmode/init.c and it of course complains because it hasn't
> seen that struct yet.
> 

Exactly. Last time, i fixed the warnings in my first patch, please refer to the 
patch v8(resend).

Link: https://lkml.kernel.org/r/20191031033517.11282-2-liji...@redhat.com
  -[PATCH 1/2 RESEND v8] x86/kdump: always reserve the low 1M when the 
crashkernel option is specified


And kbuild said that need to add the reported-by, please refer to the following 
Link.

Link: https://lkml.kernel.org/r/201910310233.ejrttmwp%25...@intel.com

> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 

Any idea about this? Any suggestions will be appreciated.

Thanks.
Lianbo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 RESEND v8] x86/kdump: always reserve the low 1M when the crashkernel option is specified

2019-11-01 Thread lijiang

在 2019年10月31日 18:47, Borislav Petkov 写道:
> On Thu, Oct 31, 2019 at 05:40:35PM +0800, lijiang wrote:
>> Maybe it should be a separate patch to fix the old compile warnings as 
>> follow.
>> And i should put the patch into this series.
> 
> Yes, maybe.
> 
>> commit d2091d1f4f67f1c38293b0e93fdbfefa766940cf (HEAD -> master)
>> Author: Lianbo Jiang 
>> Date:   Thu Oct 31 15:48:02 2019 +0800
>>
>> kexec: Fix i386 build warnings that missed declaration of struct kimage
>> 
>> Kbuild test robot reported some build warnings, please refer to the
>> Link below for details.
> 
> Explain here what the warnings are, why they trigger and how you're
> fixing it. How a commit message should look like is also explained in
> that document I pointed you at.
> 

OK, looks better('what-why-how'). I will improve the above log.

> Refering to some link is not what we do in commit messages.
> 
>> Add a declaration of struct kimage to fix these compile warnings.
>> 
>> Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system 
>> call")
>> Reported-by: kbuild test robot 
>> Signed-off-by: Lianbo Jiang 
>> Link: https://lkml.org/lkml/2019/10/30/833
> 
> *NEVER* use lkml.org or any other external URL for refering to mail
> threads but *always* use our own
> 
> lkml.kernel.org/r/
> 
> redirector. See other tip commits for an example.
> 

It's useful to me. Thanks.

>>> You can read
>>>
>>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>>>
>>> in the meantime, especially section
>>>
>>> "9) Don't get discouraged - or impatient"
>>>
>>> while waiting.
>>
>> OK. Thanks.
> 
> And make sure to read that whole document and also have a look at the
> process document
> 

I will read the above document carefully. But some of the rules in the document 
are
still easy to be forgot, maybe need to practice repeatedly.

> https://www.kernel.org/doc/html/latest/process/index.html
> 
> so that you can avoid such mistakes in the future.
> 

Good suggestions. Thank you so much.

Lianbo

> Thx.
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 RESEND v8] x86/kdump: always reserve the low 1M when the crashkernel option is specified

2019-10-31 Thread lijiang

在 2019年10月31日 15:13, Borislav Petkov 写道:
> Please do not merge a 0day bot fix with another patch of yours which
> does not cause it in the first place. When you look at this patch alone,
> what do you think the Reported-by tag means, if anything at all?
> 
Thanks for your suggestions.

Maybe it should be a separate patch to fix the old compile warnings as follow.
And i should put the patch into this series.


commit d2091d1f4f67f1c38293b0e93fdbfefa766940cf (HEAD -> master)
Author: Lianbo Jiang 
Date:   Thu Oct 31 15:48:02 2019 +0800

kexec: Fix i386 build warnings that missed declaration of struct kimage

Kbuild test robot reported some build warnings, please refer to the
Link below for details.

Add a declaration of struct kimage to fix these compile warnings.

Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system 
call")
Reported-by: kbuild test robot 
Signed-off-by: Lianbo Jiang 
Link: https://lkml.org/lkml/2019/10/30/833

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
index 0acf5ee45a21..ef5638f641f2 100644
--- a/arch/x86/include/asm/crash.h
+++ b/arch/x86/include/asm/crash.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_CRASH_H
 #define _ASM_X86_CRASH_H
 
+struct kimage;
+
 int crash_load_segments(struct kimage *image);
 int crash_copy_backup_region(struct kimage *image);
 int crash_setup_memmap_entries(struct kimage *image,

> Also, it is not a "RESEND" if you change them. You can call them v8.1 or
> whatever to denote that the change is small.
> 
Thanks for your explanation in detail.

> Also, do not send v9 or v8.1 or whatever, immediately but wait for other
> reviews.

OK. Lets wait a week or more.

> You have sent these patches 4(!) times in this week alone. How
> would you feel if I hammer your inbox with patches on a daily basis?
>Probably because the change is small.

Anyway, so sorry, it seems inconsiderate.

> You can read
> 
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
> 
> in the meantime, especially section
> 
> "9) Don't get discouraged - or impatient"
> 
> while waiting.

OK. Thanks.

Lianbo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v7] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-30 Thread lijiang

在 2019年10月31日 02:25, kbuild test robot 写道:
> Hi Lianbo,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v5.4-rc5 next-20191030]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
> 
> url:
> https://github.com/0day-ci/linux/commits/Lianbo-Jiang/x86-kdump-Fix-kmem-s-reported-an-invalid-freepointer-when-SME-was-active/20191031-001903
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> 32e72ec0613e164ce9608d865396fb2da278
> config: i386-defconfig (attached as .config)
> compiler: gcc-7 (Debian 7.4.0-14) 7.4.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All warnings (new ones prefixed by >>):
> 
>In file included from arch/x86/realmode/init.c:11:0:
>>> arch/x86/include/asm/crash.h:5:32: warning: 'struct kimage' declared inside 
>>> parameter list will not be visible outside of this definition or declaration
> int crash_load_segments(struct kimage *image);
>^~
>arch/x86/include/asm/crash.h:6:37: warning: 'struct kimage' declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> int crash_copy_backup_region(struct kimage *image);
> ^~
>arch/x86/include/asm/crash.h:7:39: warning: 'struct kimage' declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> int crash_setup_memmap_entries(struct kimage *image,
>   ^~
> 
Hi,

The above warnings will still occur without my patches.

But i will fix the warnings in my patch series, and resend v8 later.

Thanks.

Lianbo

> vim +5 arch/x86/include/asm/crash.h
> 
> dd5f726076cc76 Vivek Goyal 2014-08-08   4  
> dd5f726076cc76 Vivek Goyal 2014-08-08  @5  int crash_load_segments(struct 
> kimage *image);
> dd5f726076cc76 Vivek Goyal 2014-08-08   6  int 
> crash_copy_backup_region(struct kimage *image);
> dd5f726076cc76 Vivek Goyal 2014-08-08   7  int 
> crash_setup_memmap_entries(struct kimage *image,
> dd5f726076cc76 Vivek Goyal 2014-08-08   8 struct boot_params 
> *params);
> 89f579ce99f7e0 Yi Wang 2018-11-22   9  void crash_smp_send_stop(void);
> dd5f726076cc76 Vivek Goyal 2014-08-08  10  
> 
> :: The code at line 5 was first introduced by commit
> :: dd5f726076cc7639d9713b334c8c133f77c6757a kexec: support for kexec on 
> panic using new system call
> 

Exactly.

> :: TO: Vivek Goyal 
> :: CC: Linus Torvalds 
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2 v8] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-10-30 Thread lijiang

Hi, 

Please ignore this patch series because the compile warnings are reported by 
kduild.

I will resend v8 later after the warnings are fixed.

Sorry for this.

Thanks.
Lianbo

在 2019年10月30日 11:54, Lianbo Jiang 写道:
> In purgatory(), the main things are as below:
> 
> [1] verify sha256 hashes for various segments.
> Lets keep these codes, and do not touch the logic.
> 
> [2] copy the first 640k content to a backup region.
> Lets safely remove it and clean all code related to backup region.
> 
> This patch series will remove the backup region, because the current
> handling of copying the first 640k runs into problems when SME is
> active(https://bugzilla.kernel.org/show_bug.cgi?id=204793).
> 
> The low 1M region will always be reserved when the crashkernel kernel
> command line option is specified. And this way makes it unnecessary to
> do anything with the low 1M region, because the memory allocated later
> won't fall into the low 1M area.
> 
> This series includes two patches:
> [1] x86/kdump: always reserve the low 1M when the crashkernel option
> is specified
> The low 1M region will always be reserved when the crashkernel
> kernel command line option is specified, which ensures that the
> memory allocated later won't fall into the low 1M area.
> 
> [2] x86/kdump: clean up all the code related to the backup region
> Remove the backup region and clean up.
> 
> Changes since v1:
> [1] Add extra checking condition: when the crashkernel option is
> specified, reserve the low 640k area.
> 
> Changes since v2:
> [1] Reserve the low 1M region when the crashkernel option is only
> specified.(Suggested by Eric)
> 
> [2] Remove the unused crash_copy_backup_region()
> 
> [3] Remove the backup region and clean up
> 
> [4] Split them into three patches
> 
> Changes since v3:
> [1] Improve the first patch's log
> 
> [2] Improve the third patch based on Eric's suggestions
> 
> Changes since v4:
> [1] Correct some typos, and also improve the first patch's log
> 
> [2] Add a new function kexec_reserve_low_1MiB() in kernel/kexec_core.c
> and which is called by reserve_real_mode(). (Suggested by Boris)
> 
> Changes since v5:
> [1] Call the cmdline_find_option() instead of strstr() to check the
> crashkernel option. (Suggested by Hatayama)
> 
> [2] Add a weak function kexec_reserve_low_1MiB() in kernel/kexec_core.c,
> and implement the kexec_reserve_low_1MiB() in arch/x86/kernel/
> machine_kexec_64.c so that it does not cause the compile error
> on non-x86 kernel, and also ensures that it can work well on x86
> kernel.
> 
> Changes since v6:
> [1] Move the kexec_reserve_low_1MiB() to arch/x86/kernel/crash.c and
> also move its declaration function to arch/x86/include/asm/crash.h
> (Suggested by Dave Young)
> 
> [2] Adjust the corresponding header files.
> 
> Changes since v7:
> [1] Change the function name from kexec_reserve_low_1MiB() to
> crash_reserve_low_1M().
> 
> Lianbo Jiang (2):
>   x86/kdump: always reserve the low 1M when the crashkernel option is
> specified
>   x86/kdump: clean up all the code related to the backup region
> 
>  arch/x86/include/asm/crash.h   |   6 ++
>  arch/x86/include/asm/kexec.h   |  10 ---
>  arch/x86/include/asm/purgatory.h   |  10 ---
>  arch/x86/kernel/crash.c| 102 -
>  arch/x86/kernel/machine_kexec_64.c |  47 -
>  arch/x86/purgatory/purgatory.c |  19 --
>  arch/x86/realmode/init.c   |   2 +
>  7 files changed, 34 insertions(+), 162 deletions(-)
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v7] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-29 Thread lijiang

在 2019年10月29日 13:28, Baoquan He 写道:
> On 10/29/19 at 10:10am, Lianbo Jiang wrote:
>> Kdump kernel will reuse the first 640k region because the real mode
>> trampoline has to work in this area. When the vmcore is dumped, the
>> old memory in this area may be accessed, therefore, kernel has to
>> copy the contents of the first 640k area to a backup region so that
>> kdump kernel can read the old memory from the backup area of the
>> first 640k area, which is done in the purgatory().
>>
>> But, the current handling of copying the first 640k area runs into
>> problems when SME is enabled, kernel does not properly copy these
>> old memory to the backup area in the purgatory(), thereby, kdump
>> kernel reads out the encrypted contents, because the kdump kernel
>> must access the first kernel's memory with the encryption bit set
>> when SME is enabled in the first kernel. Please refer to this link:
>>
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>
>> Finally, it causes the following errors, and the crash tool gets
>> invalid pointers when parsing the vmcore.
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
>>
>> To avoid the above errors, when the crashkernel option is specified,
>> lets reserve the remaining low 1MiB memory(after reserving real mode
>> memory) so that the allocated memory does not fall into the low 1MiB
>> area, which makes us not to copy the first 640k content to a backup
>> region in purgatory(). This indicates that it does not need to be
>> included in crash dumps or used for anything except the processor
>> trampolines that must live in the low 1MiB.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/include/asm/crash.h |  6 ++
>>  arch/x86/kernel/crash.c  | 15 +++
>>  arch/x86/realmode/init.c |  2 ++
>>  3 files changed, 23 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
>> index 0acf5ee45a21..3e966a3dc823 100644
>> --- a/arch/x86/include/asm/crash.h
>> +++ b/arch/x86/include/asm/crash.h
>> @@ -8,4 +8,10 @@ int crash_setup_memmap_entries(struct kimage *image,
>>  struct boot_params *params);
>>  void crash_smp_send_stop(void);
>>  
>> +#ifdef CONFIG_KEXEC_CORE
>> +void __init kexec_reserve_low_1MiB(void);
>> +#else
>> +static inline void __init kexec_reserve_low_1MiB(void) { }
>> +#endif
>> +
>>  #endif /* _ASM_X86_CRASH_H */
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index eb651fbde92a..144f519aef29 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -24,6 +24,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -39,6 +40,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  /* Used while preparing memory map entries for second kernel */
>>  struct crash_memmap_data {
>> @@ -68,6 +70,19 @@ static inline void cpu_crash_vmclear_loaded_vmcss(void)
>>  rcu_read_unlock();
>>  }
>>  
>> +/*
>> + * When the crashkernel option is specified, only use the low
>> + * 1MiB for the real mode trampoline.
>> + */
>> +void __init kexec_reserve_low_1MiB(void)
> 
> Thanks for the effort, Lianbo. I believe everyone is confident with this
> solution and fix.
> 
> I have a tiny concern, why the function name is
> kexec_reserve_low_1MiB(), but not kexec_reserve_low_1M()?

Thanks for your comment, Baoquan.

It means that kernel will reserve 1M 'Byte' memory, the function name does not
have special meaning.

Would you mind if i change it to the crash_reserve_low_1M()?

void __init crash_reserve_low_1M(void)

Thanks.
Lianbo

> I searched in kernel code with below filter, didn't see MiB appearing in
> a function name. I am not sure about it either, just ask.
> 
> git grep "_[1-9]*M " arch/ kernel/ mm include/ drivers/ net/ init fs crypto/ 
> certs/ ipc lib
> 
> Thanks
> Baoquan
> 
>> +{
>> +if (cmdline_find_option(boot_command_line, "crashkernel",
>> +NULL, 0) > 0) {
>> +memblock_reserve(0, 1<<20);
>> +pr_info("Reserving the low 1MiB of memory for crashkernel\n");
>> +}
>> +}
>> +
>>  #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
>>  
>>  static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>> index 7dce39c8c034..b8bbd0017ca8 100644
>> --- a/arch/x86/realmode/init.c
>> +++ b/arch/x86/realmode/init.c
>> @@ -8,6 +8,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  struct real_mode_header *real_mode_header;
>>  u32 *trampoline_cr4_features;
>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>  
>>  memblock_reserve(mem, size);
>>  set_real_mode_mem(mem);
>> +kexec_reserve_low_1MiB();
>>  }
>>  
>>  static void __init setup_real_mode(void)

Re: [PATCH 1/2 v6] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-27 Thread lijiang

在 2019年10月28日 11:19, Dave Young 写道:
> On 10/28/19 at 10:45am, Lianbo Jiang wrote:
>> Kdump kernel will reuse the first 640k region because the real mode
>> trampoline has to work in this area. When the vmcore is dumped, the
>> old memory in this area may be accessed, therefore, kernel has to
>> copy the contents of the first 640k area to a backup region so that
>> kdump kernel can read the old memory from the backup area of the
>> first 640k area, which is done in the purgatory().
>>
>> But, the current handling of copying the first 640k area runs into
>> problems when SME is enabled, kernel does not properly copy these
>> old memory to the backup area in the purgatory(), thereby, kdump
>> kernel reads out the encrypted contents, because the kdump kernel
>> must access the first kernel's memory with the encryption bit set
>> when SME is enabled in the first kernel. Please refer to this link:
>>
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>
>> Finally, it causes the following errors, and the crash tool gets
>> invalid pointers when parsing the vmcore.
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
>>
>> To avoid the above errors, when the crashkernel option is specified,
>> lets reserve the remaining low 1MiB memory(after reserving real mode
>> memory) so that the allocated memory does not fall into the low 1MiB
>> area, which makes us not to copy the first 640k content to a backup
>> region in purgatory(). This indicates that it does not need to be
>> included in crash dumps or used for anything except the processor
>> trampolines that must live in the low 1MiB.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> BTW:I also tried to fix the above problem in purgatory(), but there
>> are too many restricts in purgatory() context, for example: i can't
>> allocate new memory to create the identity mapping page table for
>> SME situation.
>>
>> Currently, there are two places where the first 640k area is needed,
>> the first one is in the find_trampoline_placement(), another one is
>> in the reserve_real_mode(), and their content doesn't matter.
>>
>> In addition, also need to clean all the code related to the backup
>> region later.
>>
>>  arch/x86/kernel/machine_kexec_64.c | 15 +++
>>  arch/x86/realmode/init.c   |  2 ++
>>  include/linux/kexec.h  |  2 ++
>>  kernel/kexec_core.c|  3 +++
>>  4 files changed, 22 insertions(+)
>>
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index 5dcd438ad8f2..42d7c15c45f1 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -17,6 +17,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -27,6 +28,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #ifdef CONFIG_ACPI
>>  /*
>> @@ -687,3 +689,16 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned 
>> int pages)
>>   */
>>  set_memory_encrypted((unsigned long)vaddr, pages);
>>  }
>> +
>> +/*
>> + * When the crashkernel option is specified, only use the low
>> + * 1MiB for the real mode trampoline.
>> + */
>> +void __init kexec_reserve_low_1MiB(void)
>> +{
>> +if (cmdline_find_option(boot_command_line, "crashkernel",
>> +NULL, 0) > 0) {
>> +memblock_reserve(0, 1<<20);
>> +pr_info("Reserving the low 1MiB of memory for crashkernel\n");
>> +}
>> +}
>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>> index 7dce39c8c034..064cc79a015d 100644
>> --- a/arch/x86/realmode/init.c
>> +++ b/arch/x86/realmode/init.c
>> @@ -3,6 +3,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>  
>>  memblock_reserve(mem, size);
>>  set_real_mode_mem(mem);
>> +kexec_reserve_low_1MiB();
>>  }
>>  
>>  static void __init setup_real_mode(void)
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index 1776eb2e43a4..988bf2de51a7 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
>>  extern void crash_kexec(struct pt_regs *);
>>  int kexec_should_crash(struct task_struct *);
>>  int kexec_crash_loaded(void);
>> +void __init kexec_reserve_low_1MiB(void);
>>  void crash_save_cpu(struct pt_regs *regs, int cpu);
>>  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>>  
>> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { 
>> }
>>  static inline void crash_kexec(struct pt_regs *regs) { }
>>  static inline int kexec_should_crash(struct task_struct *p) { return 0; }
>>  static inline int kexec_crash_loaded(void) { return 0; }
>> +static

Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-25 Thread lijiang

在 2019年10月25日 11:39, Eric W. Biederman 写道:
> lijiang  writes:
> 
>>  * Returns the length of the argument (regardless of if it was
>>  * truncated to fit in the buffer), or -1 on not found.
>>  */
>> static int
>> __cmdline_find_option(const char *cmdline, int max_cmdline_size,
>>   const char *option, char *buffer, int bufsize)
>>
>>
>> According to the above code comment, it should be better like this:
>>
>> +   if (cmdline_find_option(boot_command_line, "crashkernel",
>> +   NULL, 0) > 0) {
>>
>> After i test, i will post again.
>>
> 
> This seems reasonable as we are dealing with x86 only code.
> 
When we compile the non-x86 kernel, that could cause the the compile error
because the cmdline_find_option() won't be defined on non-x86 architecture.
So i will define a weak function in the kernel/kexec_core.c like this:
+
+void __init __weak kexec_reserve_low_1MiB(void)
+{}

and implement the kexec_reserve_low_1MiB() in the 
arch/x86/kernel/machine_kexec_64.c.

+/*
+ * When the crashkernel option is specified, only use the low
+ * 1MiB for the real mode trampoline.
+ */
+void __init kexec_reserve_low_1MiB(void)
+{
+   if (cmdline_find_option(boot_command_line, "crashkernel",
+   NULL, 0) > 0) {
+   memblock_reserve(0, 1<<20);
+   pr_info("Reserving the low 1MiB of memory for crashkernel\n");
+   }
+} 

That will solve the compile error on the non-x86 kernel, and it also works well 
on
the x86 kernel.

BTW: i pasted the code at the end, please refer to it.

> It wound be nice if someone could generalize cmdline_find_option to be
> arch independent so that crash_core.c:parse_crashkernel could use it.

Good point, that could be done in the future.

> I don't think for this patchset, but it looks like an overdue cleanup.
> 
> We run the risk with parse_crashkernel using strstr and this using
> another algorithm of having different kernel command line parsers
> giving different results and disagreeing if "crashkernel=" is present
> or not on the kernel command line.
> 
Indeed, but sometimes, the crashkernel has a complicated syntax, maybe
that could be a reason.

Thanks.
Lianbo

> Eric
> 
> 

---
 arch/x86/kernel/machine_kexec_64.c | 15 +++
 arch/x86/realmode/init.c   |  2 ++
 include/linux/kexec.h  |  2 ++
 kernel/kexec_core.c|  3 +++
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 5dcd438ad8f2..42d7c15c45f1 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -27,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_ACPI
 /*
@@ -687,3 +689,16 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned int 
pages)
 */
set_memory_encrypted((unsigned long)vaddr, pages);
 }
+
+/*
+ * When the crashkernel option is specified, only use the low
+ * 1MiB for the real mode trampoline.
+ */
+void __init kexec_reserve_low_1MiB(void)
+{
+   if (cmdline_find_option(boot_command_line, "crashkernel",
+   NULL, 0) > 0) {
+   memblock_reserve(0, 1<<20);
+   pr_info("Reserving the low 1MiB of memory for crashkernel\n");
+   }
+}
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 7dce39c8c034..064cc79a015d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
 
memblock_reserve(mem, size);
set_real_mode_mem(mem);
+   kexec_reserve_low_1MiB();
 }
 
 static void __init setup_real_mode(void)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 1776eb2e43a4..988bf2de51a7 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 int kexec_crash_loaded(void);
+void __init kexec_reserve_low_1MiB(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 
@@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
 static inline int kexec_crash_loaded(void) { return 0; }
+static inline void __init kexec_reserve_low_1MiB(void) { }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */
 
d

Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread lijiang

在 2019年10月25日 09:38, d.hatay...@fujitsu.com 写道:> 
> 
>> -Original Message-----
>> From: lijiang [mailto:liji...@redhat.com]
>> Sent: Friday, October 25, 2019 10:31 AM
>> To: Simon Horman ; Hatayama, Daisuke/畑山 大輔
>> 
>> Cc: linux-ker...@vger.kernel.org; jgr...@suse.com; thomas.lenda...@amd.com;
>> b...@redhat.com; x...@kernel.org; kexec@lists.infradead.org;
>> dhowe...@redhat.com; mi...@redhat.com; b...@alien8.de; ebied...@xmission.com;
>> h...@zytor.com; t...@linutronix.de; dyo...@redhat.com; vgo...@redhat.com
>> Subject: Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the
>> crashkernel option is specified
>>
>> 在 2019年10月24日 19:33, lijiang 写道:
>>> 在 2019年10月24日 18:07, Simon Horman 写道:
>>>> Hi Linbo,
>>>>
>>>> thanks for your patch.
>>>>
>>>> On Wed, Oct 23, 2019 at 10:19:11PM +0800, Lianbo Jiang wrote:
>>>>> Kdump kernel will reuse the first 640k region because the real mode
>>>>> trampoline has to work in this area. When the vmcore is dumped, the
>>>>> old memory in this area may be accessed, therefore, kernel has to
>>>>> copy the contents of the first 640k area to a backup region so that
>>>>> kdump kernel can read the old memory from the backup area of the
>>>>> first 640k area, which is done in the purgatory().
>>>>>
>>>>> But, the current handling of copying the first 640k area runs into
>>>>> problems when SME is enabled, kernel does not properly copy these
>>>>> old memory to the backup area in the purgatory(), thereby, kdump
>>>>> kernel reads out the encrypted contents, because the kdump kernel
>>>>> must access the first kernel's memory with the encryption bit set
>>>>> when SME is enabled in the first kernel. Please refer to this link:
>>>>>
>>>>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>>>>
>>>>> Finally, it causes the following errors, and the crash tool gets
>>>>> invalid pointers when parsing the vmcore.
>>>>>
>>>>> crash> kmem -s|grep -i invalid
>>>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
>> freepointer:a6086ac099f0c5a4
>>>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
>> freepointer:a6086ac099f0c5a4
>>>>> crash>
>>>>>
>>>>> To avoid the above errors, when the crashkernel option is specified,
>>>>> lets reserve the remaining low 1MiB memory(after reserving real mode
>>>>> memory) so that the allocated memory does not fall into the low 1MiB
>>>>> area, which makes us not to copy the first 640k content to a backup
>>>>> region in purgatory(). This indicates that it does not need to be
>>>>> included in crash dumps or used for anything except the processor
>>>>> trampolines that must live in the low 1MiB.
>>>>>
>>>>> Signed-off-by: Lianbo Jiang 
>>>>> ---
>>>>> BTW:I also tried to fix the above problem in purgatory(), but there
>>>>> are too many restricts in purgatory() context, for example: i can't
>>>>> allocate new memory to create the identity mapping page table for
>>>>> SME situation.
>>>>>
>>>>> Currently, there are two places where the first 640k area is needed,
>>>>> the first one is in the find_trampoline_placement(), another one is
>>>>> in the reserve_real_mode(), and their content doesn't matter.
>>>>>
>>>>> In addition, also need to clean all the code related to the backup
>>>>> region later.
>>>>>
>>>>>  arch/x86/realmode/init.c |  2 ++
>>>>>  include/linux/kexec.h|  2 ++
>>>>>  kernel/kexec_core.c  | 13 +
>>>>>  3 files changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>>>>> index 7dce39c8c034..064cc79a015d 100644
>>>>> --- a/arch/x86/realmode/init.c
>>>>> +++ b/arch/x86/realmode/init.c
>>>>> @@ -3,6 +3,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>
>>>>>  #include 
>>>>>  #include 
>>>>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>>>>
>>>>>   membloc

Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread lijiang

在 2019年10月24日 19:33, lijiang 写道:
> 在 2019年10月24日 18:07, Simon Horman 写道:
>> Hi Linbo,
>>
>> thanks for your patch.
>>
>> On Wed, Oct 23, 2019 at 10:19:11PM +0800, Lianbo Jiang wrote:
>>> Kdump kernel will reuse the first 640k region because the real mode
>>> trampoline has to work in this area. When the vmcore is dumped, the
>>> old memory in this area may be accessed, therefore, kernel has to
>>> copy the contents of the first 640k area to a backup region so that
>>> kdump kernel can read the old memory from the backup area of the
>>> first 640k area, which is done in the purgatory().
>>>
>>> But, the current handling of copying the first 640k area runs into
>>> problems when SME is enabled, kernel does not properly copy these
>>> old memory to the backup area in the purgatory(), thereby, kdump
>>> kernel reads out the encrypted contents, because the kdump kernel
>>> must access the first kernel's memory with the encryption bit set
>>> when SME is enabled in the first kernel. Please refer to this link:
>>>
>>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>>
>>> Finally, it causes the following errors, and the crash tool gets
>>> invalid pointers when parsing the vmcore.
>>>
>>> crash> kmem -s|grep -i invalid
>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>> freepointer:a6086ac099f0c5a4
>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>> freepointer:a6086ac099f0c5a4
>>> crash>
>>>
>>> To avoid the above errors, when the crashkernel option is specified,
>>> lets reserve the remaining low 1MiB memory(after reserving real mode
>>> memory) so that the allocated memory does not fall into the low 1MiB
>>> area, which makes us not to copy the first 640k content to a backup
>>> region in purgatory(). This indicates that it does not need to be
>>> included in crash dumps or used for anything except the processor
>>> trampolines that must live in the low 1MiB.
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>> BTW:I also tried to fix the above problem in purgatory(), but there
>>> are too many restricts in purgatory() context, for example: i can't
>>> allocate new memory to create the identity mapping page table for
>>> SME situation.
>>>
>>> Currently, there are two places where the first 640k area is needed,
>>> the first one is in the find_trampoline_placement(), another one is
>>> in the reserve_real_mode(), and their content doesn't matter.
>>>
>>> In addition, also need to clean all the code related to the backup
>>> region later.
>>>
>>>  arch/x86/realmode/init.c |  2 ++
>>>  include/linux/kexec.h|  2 ++
>>>  kernel/kexec_core.c  | 13 +
>>>  3 files changed, 17 insertions(+)
>>>
>>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>>> index 7dce39c8c034..064cc79a015d 100644
>>> --- a/arch/x86/realmode/init.c
>>> +++ b/arch/x86/realmode/init.c
>>> @@ -3,6 +3,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>>  
>>> memblock_reserve(mem, size);
>>> set_real_mode_mem(mem);
>>> +   kexec_reserve_low_1MiB();
>>>  }
>>>  
>>>  static void __init setup_real_mode(void)
>>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>>> index 1776eb2e43a4..30acf1d738bc 100644
>>> --- a/include/linux/kexec.h
>>> +++ b/include/linux/kexec.h
>>> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
>>>  extern void crash_kexec(struct pt_regs *);
>>>  int kexec_should_crash(struct task_struct *);
>>>  int kexec_crash_loaded(void);
>>> +void __init kexec_reserve_low_1MiB(void);
>>>  void crash_save_cpu(struct pt_regs *regs, int cpu);
>>>  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>>>  
>>> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) 
>>> { }
>>>  static inline void crash_kexec(struct pt_regs *regs) { }
>>>  static inline int kexec_should_crash(struct task_struct *p) { return 0; }
>>>  static inline int kexec_crash_loaded(void) { return 0; }
>>> +static inline void __init kexec_reserve_low_1MiB(void) { }
>>>  #def

Re: [PATCH] x86/kdump: always reserve the low 1MiB when the crashkernel

2019-10-24 Thread lijiang

在 2019年10月25日 06:12, kbuild test robot 写道:
> Hi lijiang,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on linus/master]
> [cannot apply to v5.4-rc4 next-20191024]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
> 
> url:
> https://github.com/0day-ci/linux/commits/lijiang/x86-kdump-always-reserve-the-low-1MiB-when-the-crashkernel/20191025-030439
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> f116b96685a046a89c25d4a6ba2da489145c
> config: i386-defconfig (attached as .config)
> compiler: gcc-7 (Debian 7.4.0-14) 7.4.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All warnings (new ones prefixed by >>):
> 
>>> WARNING: vmlinux.o(.text+0xe39b7): Section mismatch in reference from the 
>>> function kexec_reserve_low_1MiB() to the variable 
>>> .init.data:boot_command_line
>The function kexec_reserve_low_1MiB() references
>the variable __initdata boot_command_line.
>This is often because kexec_reserve_low_1MiB lacks a __initdata
>annotation or the annotation of boot_command_line is wrong.
> --
>>> WARNING: vmlinux.o(.text+0xe39d0): Section mismatch in reference from the 
>>> function kexec_reserve_low_1MiB() to the function 
>>> .meminit.text:memblock_reserve()
>The function kexec_reserve_low_1MiB() references
>the function __meminit memblock_reserve().
>This is often because kexec_reserve_low_1MiB lacks a __meminit
>annotation or the annotation of memblock_reserve is wrong.
> 
These warnings have been fixed in patch v5. Please refer to the latest patch v5.

Thanks.
Lianbo

> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread lijiang

在 2019年10月24日 18:07, Simon Horman 写道:
> Hi Linbo,
> 
> thanks for your patch.
> 
> On Wed, Oct 23, 2019 at 10:19:11PM +0800, Lianbo Jiang wrote:
>> Kdump kernel will reuse the first 640k region because the real mode
>> trampoline has to work in this area. When the vmcore is dumped, the
>> old memory in this area may be accessed, therefore, kernel has to
>> copy the contents of the first 640k area to a backup region so that
>> kdump kernel can read the old memory from the backup area of the
>> first 640k area, which is done in the purgatory().
>>
>> But, the current handling of copying the first 640k area runs into
>> problems when SME is enabled, kernel does not properly copy these
>> old memory to the backup area in the purgatory(), thereby, kdump
>> kernel reads out the encrypted contents, because the kdump kernel
>> must access the first kernel's memory with the encryption bit set
>> when SME is enabled in the first kernel. Please refer to this link:
>>
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>
>> Finally, it causes the following errors, and the crash tool gets
>> invalid pointers when parsing the vmcore.
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
>>
>> To avoid the above errors, when the crashkernel option is specified,
>> lets reserve the remaining low 1MiB memory(after reserving real mode
>> memory) so that the allocated memory does not fall into the low 1MiB
>> area, which makes us not to copy the first 640k content to a backup
>> region in purgatory(). This indicates that it does not need to be
>> included in crash dumps or used for anything except the processor
>> trampolines that must live in the low 1MiB.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> BTW:I also tried to fix the above problem in purgatory(), but there
>> are too many restricts in purgatory() context, for example: i can't
>> allocate new memory to create the identity mapping page table for
>> SME situation.
>>
>> Currently, there are two places where the first 640k area is needed,
>> the first one is in the find_trampoline_placement(), another one is
>> in the reserve_real_mode(), and their content doesn't matter.
>>
>> In addition, also need to clean all the code related to the backup
>> region later.
>>
>>  arch/x86/realmode/init.c |  2 ++
>>  include/linux/kexec.h|  2 ++
>>  kernel/kexec_core.c  | 13 +
>>  3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>> index 7dce39c8c034..064cc79a015d 100644
>> --- a/arch/x86/realmode/init.c
>> +++ b/arch/x86/realmode/init.c
>> @@ -3,6 +3,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>  
>>  memblock_reserve(mem, size);
>>  set_real_mode_mem(mem);
>> +kexec_reserve_low_1MiB();
>>  }
>>  
>>  static void __init setup_real_mode(void)
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index 1776eb2e43a4..30acf1d738bc 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
>>  extern void crash_kexec(struct pt_regs *);
>>  int kexec_should_crash(struct task_struct *);
>>  int kexec_crash_loaded(void);
>> +void __init kexec_reserve_low_1MiB(void);
>>  void crash_save_cpu(struct pt_regs *regs, int cpu);
>>  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>>  
>> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { 
>> }
>>  static inline void crash_kexec(struct pt_regs *regs) { }
>>  static inline int kexec_should_crash(struct task_struct *p) { return 0; }
>>  static inline int kexec_crash_loaded(void) { return 0; }
>> +static inline void __init kexec_reserve_low_1MiB(void) { }
>>  #define kexec_in_progress false
>>  #endif /* CONFIG_KEXEC_CORE */
>>  
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 15d70a90b50d..5bd89f1fee42 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -37,6 +37,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -70,6 +71,18 @@ struct resource crashk_low_res = {
>>  .desc  = IORES_DESC_CRASH_KERNEL
>>  };
>>  
>> +/*
>> + * When the crashkernel option is specified, only use the low
>> + * 1MiB for the real mode trampoline.
>> + */
>> +void __init kexec_reserve_low_1MiB(void)
>> +{
>> +if (strstr(boot_command_line, "crashkernel=")) {
> 
> Could you comment on the issue of using strstr which
> was raised by Hatayama-san in response to an earlier revision
> of this patch?
> 

Thank you, Simon and Hatayama-san. Lets talk about it here.

> strstr() matches for example, 
> ANYEXTRACHARACTERScrashkernel=ANYEXTRACHARACTERS.
> 
> Is it enough to use

Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread lijiang

在 2019年10月24日 16:13, d.hatay...@fujitsu.com 写道:
> I don't find the corresponding patch in the v5 patchset, so I comment here.
> 
Thanks for your comment.

>> -Original Message-
>> From: linux-kernel-ow...@vger.kernel.org
>> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of lijiang
>> Sent: Wednesday, October 23, 2019 2:35 PM
>> To: Borislav Petkov 
>> Cc: linux-ker...@vger.kernel.org; t...@linutronix.de; mi...@redhat.com;
>> h...@zytor.com; x...@kernel.org; b...@redhat.com; dyo...@redhat.com;
>> jgr...@suse.com; dhowe...@redhat.com; thomas.lenda...@amd.com;
>> ebied...@xmission.com; vgo...@redhat.com; kexec@lists.infradead.org
>> Subject: Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the
>> crashkernel option is specified
>>
>> 在 2019年10月22日 16:30, Borislav Petkov 写道:
>>> This ifdeffery needs to be a function in kernel/kexec_core.c which is
>>> called by reserve_real_mode(), instead.
>>
>> Would you mind if i improve this patch as follow? Thanks.
>>
>> From 5804abec62279585f374d78ace1250505c44c6b7 Mon Sep 17 00:00:00 2001
>> From: Lianbo Jiang 
>> Date: Wed, 23 Oct 2019 11:27:04 +0800
>> Subject: [PATCH] x86/kdump: always reserve the low 1MiB when the crashkernel
>>  option is specified
>>
>> Kdump kernel will reuse the first 640k region because the real mode
>> trampoline has to work in this area. When the vmcore is dumped, the
>> old memory in this area may be accessed, therefore, kernel has to
>> copy the contents of the first 640k area to a backup region so that
>> kdump kernel can read the old memory from the backup area of the
>> first 640k area, which is done in the purgatory().
>>
>> But, the current handling of copying the first 640k area runs into
>> problems when SME is enabled, kernel does not properly copy these
>> old memory to the backup area in the purgatory(), thereby, kdump
>> kernel reads out the encrypted contents, because the kdump kernel
>> must access the first kernel's memory with the encryption bit set
>> when SME is enabled in the first kernel. Please refer to this link:
>>
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>
>> Finally, it causes the following errors, and the crash tool gets
>> invalid pointers when parsing the vmcore.
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
>> freepointer:a6086ac099f0c5a4
>> crash>
>>
>> To avoid the above errors, when the crashkernel option is specified,
>> lets reserve the remaining low 1MiB memory(after reserving real mode
>> memory) so that the allocated memory does not fall into the low 1MiB
>> area, which makes us not to copy the first 640k content to a backup
>> region in purgatory(). This indicates that it does not need to be
>> included in crash dumps or used for anything except the processor
>> trampolines that must live in the low 1MiB.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> BTW:I also tried to fix the above problem in purgatory(), but there
>> are too many restricts in purgatory() context, for example: i can't
>> allocate new memory to create the identity mapping page table for
>> SME situation.
>>
>> Currently, there are two places where the first 640k area is needed,
>> the first one is in the find_trampoline_placement(), another one is
>> in the reserve_real_mode(), and their content doesn't matter.
>>
>> In addition, also need to clean all the code related to the backup
>> region later.
>>
>>  arch/x86/realmode/init.c |  2 ++
>>  include/linux/kexec.h|  2 ++
>>  kernel/kexec_core.c  | 13 +
>>  3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>> index 7dce39c8c034..064cc79a015d 100644
>> --- a/arch/x86/realmode/init.c
>> +++ b/arch/x86/realmode/init.c
>> @@ -3,6 +3,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
>>
>>  memblock_reserve(mem, size);
>>  set_real_mode_mem(mem);
>> +kexec_reserve_low_1MiB();
>>  }
>>
>>  static void __init setup_real_mode(void)
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index 1776eb2e43a4..30acf1d738bc 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -306,6 +30

Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-23 Thread lijiang

在 2019年10月23日 15:46, Borislav Petkov 写道:
> On Wed, Oct 23, 2019 at 01:35:09PM +0800, lijiang wrote:
>> Would you mind if i improve this patch as follow? Thanks.
> 
> Yap, looks good to me.
> 
Thanks for your comment.

OK. I will post this one and the third patch in this series later.

Thanks.
Lianbo


> Thx.
>

Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-22 Thread lijiang

在 2019年10月22日 16:30, Borislav Petkov 写道:
> This ifdeffery needs to be a function in kernel/kexec_core.c which is
> called by reserve_real_mode(), instead.

Would you mind if i improve this patch as follow? Thanks.

>From 5804abec62279585f374d78ace1250505c44c6b7 Mon Sep 17 00:00:00 2001
From: Lianbo Jiang 
Date: Wed, 23 Oct 2019 11:27:04 +0800
Subject: [PATCH] x86/kdump: always reserve the low 1MiB when the crashkernel
 option is specified

Kdump kernel will reuse the first 640k region because the real mode
trampoline has to work in this area. When the vmcore is dumped, the
old memory in this area may be accessed, therefore, kernel has to
copy the contents of the first 640k area to a backup region so that
kdump kernel can read the old memory from the backup area of the
first 640k area, which is done in the purgatory().

But, the current handling of copying the first 640k area runs into
problems when SME is enabled, kernel does not properly copy these
old memory to the backup area in the purgatory(), thereby, kdump
kernel reads out the encrypted contents, because the kdump kernel
must access the first kernel's memory with the encryption bit set
when SME is enabled in the first kernel. Please refer to this link:

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793

Finally, it causes the following errors, and the crash tool gets
invalid pointers when parsing the vmcore.

crash> kmem -s|grep -i invalid
kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
freepointer:a6086ac099f0c5a4
kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
freepointer:a6086ac099f0c5a4
crash>

To avoid the above errors, when the crashkernel option is specified,
lets reserve the remaining low 1MiB memory(after reserving real mode
memory) so that the allocated memory does not fall into the low 1MiB
area, which makes us not to copy the first 640k content to a backup
region in purgatory(). This indicates that it does not need to be
included in crash dumps or used for anything except the processor
trampolines that must live in the low 1MiB.

Signed-off-by: Lianbo Jiang 
---
BTW:I also tried to fix the above problem in purgatory(), but there
are too many restricts in purgatory() context, for example: i can't
allocate new memory to create the identity mapping page table for
SME situation.

Currently, there are two places where the first 640k area is needed,
the first one is in the find_trampoline_placement(), another one is
in the reserve_real_mode(), and their content doesn't matter.

In addition, also need to clean all the code related to the backup
region later.

 arch/x86/realmode/init.c |  2 ++
 include/linux/kexec.h|  2 ++
 kernel/kexec_core.c  | 13 +
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 7dce39c8c034..064cc79a015d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -34,6 +35,7 @@ void __init reserve_real_mode(void)

memblock_reserve(mem, size);
set_real_mode_mem(mem);
+   kexec_reserve_low_1MiB();
 }

 static void __init setup_real_mode(void)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 1776eb2e43a4..30acf1d738bc 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 int kexec_crash_loaded(void);
+void kexec_reserve_low_1MiB(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);

@@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
 static inline int kexec_crash_loaded(void) { return 0; }
+static inline void kexec_reserve_low_1MiB(void) { }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 15d70a90b50d..5bd89f1fee42 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -70,6 +71,18 @@ struct resource crashk_low_res = {
.desc  = IORES_DESC_CRASH_KERNEL
 };

+/*
+ * When the crashkernel option is specified, only use the low
+ * 1MiB for the real mode trampoline.
+ */
+void kexec_reserve_low_1MiB(void)
+{
+   if (strstr(boot_command_line, "crashkernel=")) {
+   memblock_reserve(0, 1<<20);
+   pr_info("Reserving the low 1MiB of memory for crashkernel\n");
+   }
+}
+
 int kexec_should_crash(struct task_struct *p)
 {
/*
-- 
2.17.1

Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-22 Thread lijiang

在 2019年10月22日 16:30, Borislav Petkov 写道:
> On Thu, Oct 17, 2019 at 05:43:45PM +0800, Lianbo Jiang wrote:
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
> 
Thanks for your comment.

> Put that as a Link: below.
> 
Looks better. OK.

>> Kdump kernel will reuse the first 640k region because of some reasons,
> 
> s/ of some reasons//
> 
>> for example: the trampline and conventional PC system BIOS region may
> 
> spellcheck: s/trampline/trampoline/
> 
> I see two more typos in here and if you had a spellchecker enabled in
> your editor where you write the commit message, you'll see them too.
> Please use one.
> 
Good point. I just tried to enable the spellchecker in the vim and now it
has worked well. Thanks. :-) 

>> require to allocate memory in this area. Obviously, kdump kernel will
>> also overwrite the first 640k region,
> 
> Well, it is not obvious to me. Please be more specific: why would the
> kdump kernel do that?
> 
Kdump kernel will reuse the first 640k region because the real mode
trampoline has to work in this area. When the vmcore is dumped, the
old memory in this area may be accessed, therefore, kernel has to
copy the contents of the first 640k area to a backup region so that
kdump kernel can read the old memory from the backup area of the
first 640k area, which is done in the purgatory().

>> therefore, kernel has to copy
>> the contents of the first 640k area to a backup area, which is done in
>> purgatory(), because vmcore may need the old memory. When vmcore is
>> dumped, kdump kernel will read the old memory from the backup area of
>> the first 640k area.
>>
>> Basically, the main reason should be clear, kernel does not correctly
>> handle the first 640k region when SME is active,
> 
> If you mention the actual reason here, that sentence would be clearer:
> 
> "When SME is enabled in the first kernel, the kdump kernel must access
> the first kernel's memory with the encryption bit set."
> 
> Something like that. 
> 
Looks good.

>> which causes that
>> kernel does not properly copy these old memory to the backup area in
>> purgatory(). Therefore, kdump kernel reads out the incorrect contents
> 
> s/incorrect/encrypted/
> 
Exactly.

>> from the backup area when dumping vmcore. Finally, the phenomenon is
> 
> phenomenon?
> 
Finally, it caused the following errors.

>> as follow:
>>
>> [root linux]$ crash vmlinux /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
>> WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values
>>
>>   KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
>> DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore  [PARTIAL DUMP]
>> CPUS: 128
>> DATE: Thu Sep 19 08:31:18 2019
>>   UPTIME: 00:01:21
>> LOAD AVERAGE: 0.16, 0.07, 0.02
>>TASKS: 1343
>> NODENAME: amd-ethanol
>>  RELEASE: 5.3.0-rc7+
>>  VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
>>  MACHINE: x86_64  (2195 Mhz)
>>   MEMORY: 127.9 GB
>>PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>  PID: 9789
>>  COMMAND: "bash"
>> TASK: "89711894ae80  [THREAD_INFO: 89711894ae80]"
>>  CPU: 83
>>STATE: TASK_RUNNING (PANIC)
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
> 
> I fail to see what that's trying to tell me? You have invalid pointers?
> 
Yes, when parsing the vmcore via crash tool, it occurs the above errors,
the crash tool gets invalid pointers. 

>> BTW: I also tried to fix the above problem in purgatory(), but there
>> are too many restricts in purgatory() context, for example: i can't
>> allocate new memory to create the identity mapping page table for SME
>> situation.
> 
> This paragraph belongs under the "---" line below.
> 
OK. Thanks.

>> Currently, there are two places where the first 640k area is needed,
>> the first one is in the find_trampoline_placement(), another one is
>> in the reserve_real_mode(), and their content doesn't matter.
>>
>> To avoid the above error, when the crashkernel kernel command line
>> option is specified, lets reserve the remaining low 1MiB memory(
>> after reserving real mode memroy) so that the allocated memory does
>> not fall into the low 1MiB area, which makes us not to copy the first
>> 640k content to a backup region in purgatory(). This indicates that
>> it does not need to be included in crash dumps or used for anything
>> execept the processor trampolines that must live in the low 1MiB.
>>
>> In addition, also need to clean all the code related to the backup
>> region later.
> 
> Ditto.
> 
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/realmode/init.c | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
>> index 7dce39c8c034..1f0492830f2c 100644
>> ---

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-16 Thread lijiang

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang  writes:
> 
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young  writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang  writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang 
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>>>> *regs)
>>>>>>  
>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>  
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int 
>>>>>> prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  struct crash_mem *cmem = arg;
>>>>>>  
>>>>>> -cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> -cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> -cmem->nr_ranges++;
>>>>>> +if (res->start >= SZ_1M) {
>>>>>> +cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> +cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +cmem->nr_ranges++;
>>>>>> +} else if (res->end > SZ_1M) {
>>>>>> +cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> +cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +cmem->nr_ranges++;
>>>>>> +}
>>>>>
>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>  
>>>>>>  return 0;
>>>>>>  }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage 
>>>>>> *image, struct boot_params *params)
>>>>>>  memset(, 0, sizeof(struct crash_memmap_data));
>>>>>>  cmd.params = params;
>>>>>>  
>>>>>> -/* Add first 640K segment */
>>>>>> -ei.addr = image->arch.backup_src_start;
>>>>>> -ei.size = image->arch.backup_src_sz;
>>>>>> +/*
>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> + * the first zero page.
>>>>>> + */
>>>>>> +ei.addr = PAGE_SIZE;
>>>>>> +ei.size = SZ_1M - PAGE_SIZE;
>>>>>>  ei.type = E820_TYPE_RAM;
>>>>>>  add_e820_entry(params, );
>>>>>
>>>>> Likewise here.  Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no ot

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-16 Thread lijiang

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang  writes:
> 
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young  writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang  writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang 
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>>>> *regs)
>>>>>>  
>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>  
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int 
>>>>>> prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  struct crash_mem *cmem = arg;
>>>>>>  
>>>>>> -cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> -cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> -cmem->nr_ranges++;
>>>>>> +if (res->start >= SZ_1M) {
>>>>>> +cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> +cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +cmem->nr_ranges++;
>>>>>> +} else if (res->end > SZ_1M) {
>>>>>> +cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> +cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +cmem->nr_ranges++;
>>>>>> +}
>>>>>
>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>  
>>>>>>  return 0;
>>>>>>  }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage 
>>>>>> *image, struct boot_params *params)
>>>>>>  memset(, 0, sizeof(struct crash_memmap_data));
>>>>>>  cmd.params = params;
>>>>>>  
>>>>>> -/* Add first 640K segment */
>>>>>> -ei.addr = image->arch.backup_src_start;
>>>>>> -ei.size = image->arch.backup_src_sz;
>>>>>> +/*
>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> + * the first zero page.
>>>>>> + */
>>>>>> +ei.addr = PAGE_SIZE;
>>>>>> +ei.size = SZ_1M - PAGE_SIZE;
>>>>>>  ei.type = E820_TYPE_RAM;
>>>>>>  add_e820_entry(params, );
>>>>>
>>>>> Likewise here.  Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no ot

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-15 Thread lijiang

在 2019年10月15日 19:11, Eric W. Biederman 写道:
> lijiang  writes:
> 
>> 在 2019年10月12日 20:16, Dave Young 写道:
>>> Hi Eric,
>>>
>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>> Lianbo Jiang  writes:
>>>>
>>>>> When the crashkernel kernel command line option is specified, the
>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>> necessary to create a backup region and also no need to copy the first
>>>>> 640k content to a backup region.
>>>>>
>>>>> Currently, the code related to the backup region can be safely removed,
>>>>> so lets clean up.
>>>>>
>>>>> Signed-off-by: Lianbo Jiang 
>>>>> ---
>>>>
>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>> --- a/arch/x86/kernel/crash.c
>>>>> +++ b/arch/x86/kernel/crash.c
>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>>> *regs)
>>>>>  
>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>  
>>>>> -static unsigned long crash_zero_bytes;
>>>>> -
>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>  {
>>>>>   unsigned int *nr_ranges = arg;
>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
>>>>> resource *res, void *arg)
>>>>>  {
>>>>>   struct crash_mem *cmem = arg;
>>>>>  
>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> - cmem->nr_ranges++;
>>>>> + if (res->start >= SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + } else if (res->end > SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + }
>>>>
>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>> comment.
>>>
>>> Indeed it needs some code comment, this is based on some offline
>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>> mapping the system ram.
>>>
>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>> kernel can use the low 1M memory because for example the trampoline
>>> code.
>>>
>> Thank you, Eric and Dave. I will add the code comment as below if it would 
>> be OK.
>>
>> @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct 
>> resource *res, void *arg)
>>  {
>> struct crash_mem *cmem = arg;
>>  
>> -   cmem->ranges[cmem->nr_ranges].start = res->start;
>> -   cmem->ranges[cmem->nr_ranges].end = res->end;
>> -   cmem->nr_ranges++;
>> +   /*
>> +* Currently, pass the low 1MiB range to kdump kernel in e820
>> +* as system ram so that kdump kernel can also use the low 1MiB
>> +* memory due to the real mode trampoline code.
>> +* And later, the low 1MiB range will be exclued from elf header,
>> +* which will avoid remapping the 1MiB system ram when dumping
>> +* vmcore.
>> +*/
>> +   if (res->start >= SZ_1M) {
>> +   cmem->ranges[cmem->nr_ranges].start = res->start;
>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>> +   cmem->nr_ranges++;
>> +   } else if (res->end > SZ_1M) {
>> +   cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>> +   cmem->nr_ranges++;
>> +   }
>>  
>> return 0;
>>  }
> 
> I just read through the appropriate section of crash.c and the way
> things are structured doing this work in
> prepare_elf64_ram_headers_callback is wrong.
> 
> This can be done in a simpler manner in elf_header_exclude_ranges.
> Something like:
> 
Thank you, Eric. It seems that here is a more reasonable place, i will make
a test about it and improve it in next post.

Lianbo

>   /* The low 1MiB is always reserved */
>   ret = crash_exclude_mem_range(cmem, 0, 1024*1024);
>   if (ret)
>   return ret;
> 
> Eric
>

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-14 Thread lijiang

在 2019年10月12日 20:16, Dave Young 写道:
> Hi Eric,
> 
> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>> Lianbo Jiang  writes:
>>
>>> When the crashkernel kernel command line option is specified, the
>>> low 1MiB memory will always be reserved, which makes that the memory
>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>> necessary to create a backup region and also no need to copy the first
>>> 640k content to a backup region.
>>>
>>> Currently, the code related to the backup region can be safely removed,
>>> so lets clean up.
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index eb651fbde92a..cc5774fc84c0 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>  
>>>  #ifdef CONFIG_KEXEC_FILE
>>>  
>>> -static unsigned long crash_zero_bytes;
>>> -
>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>  {
>>> unsigned int *nr_ranges = arg;
>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
>>> resource *res, void *arg)
>>>  {
>>> struct crash_mem *cmem = arg;
>>>  
>>> -   cmem->ranges[cmem->nr_ranges].start = res->start;
>>> -   cmem->ranges[cmem->nr_ranges].end = res->end;
>>> -   cmem->nr_ranges++;
>>> +   if (res->start >= SZ_1M) {
>>> +   cmem->ranges[cmem->nr_ranges].start = res->start;
>>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>>> +   cmem->nr_ranges++;
>>> +   } else if (res->end > SZ_1M) {
>>> +   cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>>> +   cmem->nr_ranges++;
>>> +   }
>>
>> What is going on with this chunk?  I can guess but this needs a clear
>> comment.
> 
> Indeed it needs some code comment, this is based on some offline
> discussion.  cat /proc/vmcore will give a warning because ioremap is
> mapping the system ram.
> 
> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
> kernel can use the low 1M memory because for example the trampoline
> code.
> 
Thank you, Eric and Dave. I will add the code comment as below if it would be 
OK.

@@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct 
resource *res, void *arg)
 {
struct crash_mem *cmem = arg;
 
-   cmem->ranges[cmem->nr_ranges].start = res->start;
-   cmem->ranges[cmem->nr_ranges].end = res->end;
-   cmem->nr_ranges++;
+   /*
+* Currently, pass the low 1MiB range to kdump kernel in e820
+* as system ram so that kdump kernel can also use the low 1MiB
+* memory due to the real mode trampoline code.
+* And later, the low 1MiB range will be exclued from elf header,
+* which will avoid remapping the 1MiB system ram when dumping
+* vmcore.
+*/
+   if (res->start >= SZ_1M) {
+   cmem->ranges[cmem->nr_ranges].start = res->start;
+   cmem->ranges[cmem->nr_ranges].end = res->end;
+   cmem->nr_ranges++;
+   } else if (res->end > SZ_1M) {
+   cmem->ranges[cmem->nr_ranges].start = SZ_1M;
+   cmem->ranges[cmem->nr_ranges].end = res->end;
+   cmem->nr_ranges++;
+   }
 
return 0;
 }

>>
>>>  
>>> return 0;
>>>  }
>>
>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
>>> struct boot_params *params)
>>> memset(, 0, sizeof(struct crash_memmap_data));
>>> cmd.params = params;
>>>  
>>> -   /* Add first 640K segment */
>>> -   ei.addr = image->arch.backup_src_start;
>>> -   ei.size = image->arch.backup_src_sz;
>>> +   /*
>>> +* Add the low memory range[0x1000, SZ_1M], skip
>>> +* the first zero page.
>>> +*/
>>> +   ei.addr = PAGE_SIZE;
>>> +   ei.size = SZ_1M - PAGE_SIZE;
>>> ei.type = E820_TYPE_RAM;
>>> add_e820_entry(params, );
>>
>> Likewise here.  Why do we need a special case?
>> Why the magic with PAGE_SIZE?
> 
> Good catch, the zero page part is useless, I think no other special
> reason, just assumed zero page is not usable, but it should be ok to
> remove the special handling, just pass 0 - 1M is good enough.
>> Thanks
> Dave
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-13 Thread lijiang

在 2019年10月13日 11:54, Eric W. Biederman 写道:
> Dave Young  writes:
> 
>> Hi Eric,
>>
>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>> Lianbo Jiang  writes:
>>>
 When the crashkernel kernel command line option is specified, the
 low 1MiB memory will always be reserved, which makes that the memory
 allocated later won't fall into the low 1MiB area, thereby, it's not
 necessary to create a backup region and also no need to copy the first
 640k content to a backup region.

 Currently, the code related to the backup region can be safely removed,
 so lets clean up.

 Signed-off-by: Lianbo Jiang 
 ---
>>>
 diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
 index eb651fbde92a..cc5774fc84c0 100644
 --- a/arch/x86/kernel/crash.c
 +++ b/arch/x86/kernel/crash.c
 @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
 *regs)
  
  #ifdef CONFIG_KEXEC_FILE
  
 -static unsigned long crash_zero_bytes;
 -
  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
  {
unsigned int *nr_ranges = arg;
 @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
 resource *res, void *arg)
  {
struct crash_mem *cmem = arg;
  
 -  cmem->ranges[cmem->nr_ranges].start = res->start;
 -  cmem->ranges[cmem->nr_ranges].end = res->end;
 -  cmem->nr_ranges++;
 +  if (res->start >= SZ_1M) {
 +  cmem->ranges[cmem->nr_ranges].start = res->start;
 +  cmem->ranges[cmem->nr_ranges].end = res->end;
 +  cmem->nr_ranges++;
 +  } else if (res->end > SZ_1M) {
 +  cmem->ranges[cmem->nr_ranges].start = SZ_1M;
 +  cmem->ranges[cmem->nr_ranges].end = res->end;
 +  cmem->nr_ranges++;
 +  }
>>>
>>> What is going on with this chunk?  I can guess but this needs a clear
>>> comment.
>>
>> Indeed it needs some code comment, this is based on some offline
>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>> mapping the system ram.
>>
>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>> kernel can use the low 1M memory because for example the trampoline
>> code.
>>
>>>
  
return 0;
  }
>>>
 @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
 struct boot_params *params)
memset(, 0, sizeof(struct crash_memmap_data));
cmd.params = params;
  
 -  /* Add first 640K segment */
 -  ei.addr = image->arch.backup_src_start;
 -  ei.size = image->arch.backup_src_sz;
 +  /*
 +   * Add the low memory range[0x1000, SZ_1M], skip
 +   * the first zero page.
 +   */
 +  ei.addr = PAGE_SIZE;
 +  ei.size = SZ_1M - PAGE_SIZE;
ei.type = E820_TYPE_RAM;
add_e820_entry(params, );
>>>
>>> Likewise here.  Why do we need a special case?
>>> Why the magic with PAGE_SIZE?
>>
>> Good catch, the zero page part is useless, I think no other special
>> reason, just assumed zero page is not usable, but it should be ok to
>> remove the special handling, just pass 0 - 1M is good enough.
> 
> But if we have stopped special casing the low 1M.  Why do we need a
> special case here at all?
> 
Here, need to pass the low memory range to kdump kernel, which will guarantee
the availability of low memory in kdump kernel, otherwise, kdump kernel won't
use the low memory region.

> If you need the special case it is almost certainly wrong to say you
> have ram above 640KiB and below 1MiB.  That is the legacy ROM and video
> MMIO area.
> 
> There is a reason the original code said 640KiB.
> 
Do you mean that the 640k region is good enough here instead of 1MiB?

Thanks.
Lianbo

> Eric
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-10-07 Thread lijiang

在 2019年10月08日 01:12, Eric W. Biederman 写道:
> lijiang  writes:
> 
>> 在 2019年10月07日 17:33, Dave Young 写道:
>>> Hi Lianbo,
>>> On 10/07/19 at 03:08pm, Lianbo Jiang wrote:
>>>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>>>
>>>> Kdump kernel will reuse the first 640k region because of some reasons,
>>>> for example: the trampline and conventional PC system BIOS region may
>>>> require to allocate memory in this area. Obviously, kdump kernel will
>>>> also overwrite the first 640k region, therefore, kernel has to copy
>>>> the contents of the first 640k area to a backup area, which is done in
>>>> purgatory(), because vmcore may need the old memory. When vmcore is
>>>> dumped, kdump kernel will read the old memory from the backup area of
>>>> the first 640k area.
>>>>
>>>> Basically, the main reason should be clear, kernel does not correctly
>>>> handle the first 640k region when SME is active, which causes that
>>>> kernel does not properly copy these old memory to the backup area in
>>>> purgatory(). Therefore, kdump kernel reads out the incorrect contents
>>>> from the backup area when dumping vmcore. Finally, the phenomenon is
>>>> as follow:
>>>>
>>>> [root linux]$ crash vmlinux 
>>>> /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
>>>> WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values
>>>>
>>>>   KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
>>>> DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore  [PARTIAL 
>>>> DUMP]
>>>> CPUS: 128
>>>> DATE: Thu Sep 19 08:31:18 2019
>>>>   UPTIME: 00:01:21
>>>> LOAD AVERAGE: 0.16, 0.07, 0.02
>>>>TASKS: 1343
>>>> NODENAME: amd-ethanol
>>>>  RELEASE: 5.3.0-rc7+
>>>>  VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
>>>>  MACHINE: x86_64  (2195 Mhz)
>>>>   MEMORY: 127.9 GB
>>>>PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>>>  PID: 9789
>>>>  COMMAND: "bash"
>>>> TASK: "89711894ae80  [THREAD_INFO: 89711894ae80]"
>>>>  CPU: 83
>>>>STATE: TASK_RUNNING (PANIC)
>>>>
>>>> crash> kmem -s|grep -i invalid
>>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>>> freepointer:a6086ac099f0c5a4
>>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>>> freepointer:a6086ac099f0c5a4
>>>> crash>
>>>>
>>>> BTW: I also tried to fix the above problem in purgatory(), but there
>>>> are too many restricts in purgatory() context, for example: i can't
>>>> allocate new memory to create the identity mapping page table for SME
>>>> situation.
>>>>
>>>> Currently, there are two places where the first 640k area is needed,
>>>> the first one is in the find_trampoline_placement(), another one is
>>>> in the reserve_real_mode(), and their content doesn't matter. To avoid
>>>> the above error, lets occupy the remain memory of the first 640k region
>>>> (expect for the trampoline and real mode) so that the allocated memory
>>>> does not fall into the first 640k area when SME is active, which makes
>>>> us not to worry about whether kernel can correctly copy the contents of
>>>> the first 640k area to a backup region in the purgatory().
>>>>
>>>> Signed-off-by: Lianbo Jiang 
>>>> ---
>>>> Changes since v1:
>>>> 1. Improve patch log
>>>> 2. Change the checking condition from sme_active() to sme_active()
>>>>&& strstr(boot_command_line, "crashkernel=")
>>>>
>>>>  arch/x86/kernel/setup.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>>> index 77ea96b794bd..bdb1a02a84fd 100644
>>>> --- a/arch/x86/kernel/setup.c
>>>> +++ b/arch/x86/kernel/setup.c
>>>> @@ -1148,6 +1148,9 @@ void __init setup_arch(char **cmdline_p)
>>>>  
>>>>reserve_real_mode();
>>>>  
>>>> +  if (sme_active() && strstr(boot_command_line, "crashkernel="))
>>>> +  memblock_reserve(0, 64

Re: [PATCH v2] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-10-07 Thread lijiang

在 2019年10月07日 17:33, Dave Young 写道:
> Hi Lianbo,
> On 10/07/19 at 03:08pm, Lianbo Jiang wrote:
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>
>> Kdump kernel will reuse the first 640k region because of some reasons,
>> for example: the trampline and conventional PC system BIOS region may
>> require to allocate memory in this area. Obviously, kdump kernel will
>> also overwrite the first 640k region, therefore, kernel has to copy
>> the contents of the first 640k area to a backup area, which is done in
>> purgatory(), because vmcore may need the old memory. When vmcore is
>> dumped, kdump kernel will read the old memory from the backup area of
>> the first 640k area.
>>
>> Basically, the main reason should be clear, kernel does not correctly
>> handle the first 640k region when SME is active, which causes that
>> kernel does not properly copy these old memory to the backup area in
>> purgatory(). Therefore, kdump kernel reads out the incorrect contents
>> from the backup area when dumping vmcore. Finally, the phenomenon is
>> as follow:
>>
>> [root linux]$ crash vmlinux /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
>> WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values
>>
>>   KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
>> DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore  [PARTIAL DUMP]
>> CPUS: 128
>> DATE: Thu Sep 19 08:31:18 2019
>>   UPTIME: 00:01:21
>> LOAD AVERAGE: 0.16, 0.07, 0.02
>>TASKS: 1343
>> NODENAME: amd-ethanol
>>  RELEASE: 5.3.0-rc7+
>>  VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
>>  MACHINE: x86_64  (2195 Mhz)
>>   MEMORY: 127.9 GB
>>PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>  PID: 9789
>>  COMMAND: "bash"
>> TASK: "89711894ae80  [THREAD_INFO: 89711894ae80]"
>>  CPU: 83
>>STATE: TASK_RUNNING (PANIC)
>>
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
>>
>> BTW: I also tried to fix the above problem in purgatory(), but there
>> are too many restricts in purgatory() context, for example: i can't
>> allocate new memory to create the identity mapping page table for SME
>> situation.
>>
>> Currently, there are two places where the first 640k area is needed,
>> the first one is in the find_trampoline_placement(), another one is
>> in the reserve_real_mode(), and their content doesn't matter. To avoid
>> the above error, lets occupy the remain memory of the first 640k region
>> (expect for the trampoline and real mode) so that the allocated memory
>> does not fall into the first 640k area when SME is active, which makes
>> us not to worry about whether kernel can correctly copy the contents of
>> the first 640k area to a backup region in the purgatory().
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> Changes since v1:
>> 1. Improve patch log
>> 2. Change the checking condition from sme_active() to sme_active()
>>&& strstr(boot_command_line, "crashkernel=")
>>
>>  arch/x86/kernel/setup.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 77ea96b794bd..bdb1a02a84fd 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -1148,6 +1148,9 @@ void __init setup_arch(char **cmdline_p)
>>  
>>  reserve_real_mode();
>>  
>> +if (sme_active() && strstr(boot_command_line, "crashkernel="))
>> +memblock_reserve(0, 640*1024);
>> +
> 
> Seems you missed the comment about "unconditionally do it", only check
> crashkernel param looks better.
> 
If so, it means that copying the first 640k to a backup region is no longer 
needed, and
i should post a patch series to remove the copy_backup_region(). Any idea?

> Also I noticed reserve_crashkernel is called after initmem_init, I'm not
> sure if memblock_reserve is good enough in early code before
> initmem_init. 
>
The first zero page and real mode are also reserved before the initmem_init(),
and seems that they work well until now.

Thanks.
Lianbo

>>  trim_platform_memory_ranges();
>>  trim_low_memory_range();
>>  
>> -- 
>> 2.17.1
>>
> 
> Thanks
> Dave
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-10-05 Thread lijiang

在 2019年10月01日 15:40, Baoquan He 写道:
> On 09/30/19 at 05:14am, Eric W. Biederman wrote:
>> Baoquan He  writes:
 needs a little better description.  I know it is not a lot on modern
 systems but reserving an extra 1M of memory to avoid having to special
 case it later seems in need of calling out.

 I have an old system around that I think that 640K is about 25% of
 memory.
>>>
>>> Understood. Basically 640K is wasted in this case. But we only do like
>>> this in SME case, a condition checking is added. And system with SME is
>>> pretty new model, it may not impact the old system.
>>
>> The conditional really should be based on if we are reserving memory
>> for a kdump kernel.  AKA if crash_kernel=XXX is specified on the kernel
>> command line.
>>
>> At which point I think it would be very reasonable to unconditionally
>> reserve the low 640k, and make the whole thing a non-issue.  This would
>> allow the kdump code to just not do anything special for any of the
>> weird special case.
>>
>> It isn't perfect because we need a page or so used in the first kernel
>> for bootstrapping the secondary cpus, but that seems like the least of
>> evils.  Especially as no one will DMA to that memory.
>>
>> So please let's just change what memory we reserve when crash_kernel is
>> specified.
> 
> Yes, makes sense, thanks for pointing it out.
> 

Sorry for the delay and thanks for your comment, Eric, Baoquan and Dave Young.

I will improve patch log and add the extra condition crash_kernel. I will post
v2 later.

Thanks.
Lianbo

>>
 How we interact with BIOS tables in the first 640k needs some
 explanation.  Both in the first kernel and in the crash kernel.
>>>
>>> Yes, totally agree.
>>>
>>> Those BIOS tables have been reserved as e820 reserved regions and will
>>> be passed to kdump kernel for reusing. Memblock reserved 640K doesn't
>>> mean it will cover the whole [0, 640K) region, it only searches for
>>> available system RAM from memblock allocator.
>>
>> Careful with that assumption.  My memory is that the e820 memory map
>> frequently fails to cover areas like the real mode interrupt descriptor
>> table at address 0.
> 
> OK, will think more about this. Thanks.
>

Re: [PATCH 1/2] vmcore-dmesg/vmcore-dmesg.c: Fix shifting error reported by cppcheck

2019-09-10 Thread lijiang

在 2019年09月10日 18:21, Bhupesh Sharma 写道:
> Running 'cppcheck' static code analyzer (see cppcheck(1))
>  on 'vmcore-dmesg/vmcore-dmesg.c' shows the following
> shifting error:
> 
> $ cppcheck  --enable=all  vmcore-dmesg/vmcore-dmesg.c
> Checking vmcore-dmesg/vmcore-dmesg.c ...
> [vmcore-dmesg/vmcore-dmesg.c:17]: (error) Shifting signed 32-bit value by 31 
> bits is undefined behaviour
> 
> Fix the same via this patch.
> 
> Cc: Lianbo Jiang 
> Cc: Simon Horman 
> Signed-off-by: Bhupesh Sharma 
> ---
>  vmcore-dmesg/vmcore-dmesg.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/vmcore-dmesg/vmcore-dmesg.c b/vmcore-dmesg/vmcore-dmesg.c
> index 81c2a58..122e536 100644
> --- a/vmcore-dmesg/vmcore-dmesg.c
> +++ b/vmcore-dmesg/vmcore-dmesg.c
> @@ -6,7 +6,7 @@ typedef Elf32_Nhdr Elf_Nhdr;
>  extern const char *fname;
>  
>  /* stole this macro from kernel printk.c */
> -#define LOG_BUF_LEN_MAX (uint32_t)(1 << 31)
> +#define LOG_BUF_LEN_MAX (uint32_t)(1U << 31)
> 

This looks better. Thank you, Bhupesh.

>  static void write_to_stdout(char *buf, unsigned int nr)
>  {
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/4 v2] Limit the size of vmcore-dmesg.txt to 2G

2019-09-08 Thread lijiang

在 2019年09月08日 20:40, Simon Horman 写道:
> On Wed, Sep 04, 2019 at 09:29:20PM +0800, lijiang wrote:
>> 在 2019年09月03日 22:37, Simon Horman 写道:
>>> On Wed, Aug 28, 2019 at 05:18:58PM +0800, lijiang wrote:
>>>> Hi, Simon and other reviewers, any comment about v2?
>>>
>>> Hi,
>>>
>>> sorry for the extended delay.
>>> I will look over this.
>>>
>> Never mind. Any suggestions will be appreciated.
>>
>> Thank you in advance.
> 
> Sorry once again for the delay.
> 
> I have applied this series.
> 
OK, thank you, Simon.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/4 v2] Limit the size of vmcore-dmesg.txt to 2G

2019-09-04 Thread lijiang

在 2019年09月03日 22:37, Simon Horman 写道:
> On Wed, Aug 28, 2019 at 05:18:58PM +0800, lijiang wrote:
>> Hi, Simon and other reviewers, any comment about v2?
> 
> Hi,
> 
> sorry for the extended delay.
> I will look over this.
> 
Never mind. Any suggestions will be appreciated.

Thank you in advance.

Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: crash: `kmem -s` reported "kmem: dma-kmalloc-512: slab: ffffe192c0001000 invalid freepointer: e5ffef4e9a040b7e" on a dumped vmcore

2019-08-30 Thread lijiang

在 2019年08月17日 15:23, lijiang 写道:
> 在 2019年08月11日 10:29, lijiang 写道:
>> 在 2019年08月09日 06:37, Lendacky, Thomas 写道:
>>> On 8/1/19 8:05 PM, Dave Young wrote:
>>>> Add kexec cc list.
>>>> On 08/01/19 at 11:02pm, lijiang wrote:
>>>>> Hi, Tom
>>>>>
>>>>> Recently, i ran into a problem about SME and used crash tool to check the 
>>>>> vmcore as follow:
>>>>>
>>>>> crash> kmem -s | grep -i invalid
>>>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>>>> e5ffef4e9a040b7e
>>>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>>>> e5ffef4e9a040b7e
>>>>>
>>>>> And the crash tool reported the above error, probably, the main reason is 
>>>>> that kernel does not
>>>>> correctly handle the first 640k region when SME is enabled.
>>>>>
>>>>> When SME is enabled, the kernel and initramfs images are loaded into the 
>>>>> decrypted memory, and
>>>>> the backup area(first 640k) is also mapped as decrypted, but the first 
>>>>> 640k data is copied to
>>>>> the backup area in purgatory(). Please refer to this file: 
>>>>> arch/x86/purgatory/purgatory.c
>>>>> ..
>>>>> static int copy_backup_region(void)
>>>>> {
>>>>>  if (purgatory_backup_dest) {
>>>>>  memcpy((void *)purgatory_backup_dest,
>>>>> (void *)purgatory_backup_src, 
>>>>> purgatory_backup_sz);
>>>>>  }
>>>>>  return 0;
>>>>> }
>>>>> ..
>>>>>
>>>>> arch/x86/kernel/machine_kexec_64.c
>>>>> ..
>>>>> machine_kexec_prepare()->
>>>>> arch_update_purgatory()->
>>>>> .
>>>>>
>>>>> Actually, the firs 640k area is encrypted in the first kernel when SME is 
>>>>> enabled, here kernel
>>>>> copies the first 640k data to the backup area in purgatory(), because the 
>>>>> backup area is mapped
>>>>> as decrypted, this copying operation makes that the first 640k data is 
>>>>> decrypted(decoded) and
>>>>> saved to the backup area, but probably kernel can not aware of SME in 
>>>>> purgatory(), which causes
>>>>> kernel mistakenly read out the first 640k.
>>>>>
>>>>> In addition, i hacked kernel code as follow:
>>>>>
>>>>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>>>>> index 7bcc92add72c..a51631d36a7a 100644
>>>>> --- a/fs/proc/vmcore.c
>>>>> +++ b/fs/proc/vmcore.c
>>>>> @@ -377,6 +378,16 @@ static ssize_t __read_vmcore(char *buffer, size_t 
>>>>> buflen, loff_t *fpos,
>>>>>  m->offset + m->size - *fpos,
>>>>>  buflen);
>>>>>  start = m->paddr + *fpos - m->offset;
>>>>> +   if (m->paddr == 0x73f6) {//the backup area's 
>>>>> start address:0x73f6
>>>>> +   tmp = read_from_oldmem(buffer, tsz, 
>>>>> ,
>>>>> +   userbuf, false);
>>>>> +   } else
>>>>>  tmp = read_from_oldmem(buffer, tsz, 
>>>>> ,
>>>>> userbuf, 
>>>>> mem_encrypt_active());
>>>>>  if (tmp < 0)
>>>>>
>>>>> Here, i used the crash tool to check the vmcore, i can see that the 
>>>>> backup area is decrypted,
>>>>> except for the dma-kmalloc-512. So i suspect that kernel did not 
>>>>> correctly read out the first
>>>>> 640k data to backup area. Do you happen to know how to deal with the 
>>>>> first 640k area in purgatory()
>>>>> when SME is enabled? Any idea?
>>>
>>> I'm not all that familiar with kexec and purgatory, etc., but I think
>>> that you want to setup the page table that is active when purgatory runs
>>> so that the src and dest both have the SME encryption mask set in their
>>>

Re: [PATCH 0/4 v2] Limit the size of vmcore-dmesg.txt to 2G

2019-08-28 Thread lijiang

Hi, Simon and other reviewers, any comment about v2?

Thanks.
Lianbo


> [PATCH 1/4] Cleanup: remove the read_elf_kcore()
> Here, no need to wrap the read_elf() again, lets invoke it directly.
> So remove the read_elf_kcore() and clean up redundant code.
> 
> [PATCH 2/4] Fix an error definition about the variable 'fname'
> The variable 'fname' is mistakenly defined two twice, the first definition
> is in the vmcore-dmesg.c, and the second definition is in the elf_info.c.
> That is confused and incorrect although it's a static type, because the
> value of variable 'fname' is not assigned(set) in elf_info.c. Anyway, its
> value will be always 'null' when printing an error information.
> 
> [PATCH 3/4] Cleanup: move it back from util_lib/elf_info.c
> Some code related to vmcore-dmesg.c is put into the util_lib, which
> is not very reasonable, so lets move it back and tidy up those code.
> In addition, that will also help to limit the size of vmcore-dmesg.txt.
> 
> [PATCH 4/4] Limit the size of vmcore-dmesg.txt to 2G
> With some corrupted vmcore files, the vmcore-dmesg.txt file may
> grow forever till the kdump disk becomes full. Lets limit the
> size of vmcore-dmesg.txt to avoid such problems.
> 
> BTW: I tested this patch series on x86 64 and arm64, it also worked well.
> 
> Changes since v1:
> [1] split them([patch 1/4] and [patch 2/4]) into a separate patch.
> [2] remove a typedef definition for handler.
> [3] remove some changes of variable 'fname' and fix its error.
> 
> Lianbo Jiang (4):
>   Cleanup: remove the read_elf_kcore()
>   Fix an error definition about the variable 'fname'
>   Cleanup: move it back from util_lib/elf_info.c
>   Limit the size of vmcore-dmesg.txt to 2G
> 
>  kexec/arch/arm64/kexec-arm64.c |  2 +-
>  util_lib/elf_info.c| 65 --
>  util_lib/include/elf_info.h|  4 +--
>  vmcore-dmesg/vmcore-dmesg.c| 42 --
>  4 files changed, 57 insertions(+), 56 deletions(-)
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] cleanup: move it back from util_lib/elf_info.c

2019-08-23 Thread lijiang

在 2019年08月23日 13:24, lijiang 写道:
> 在 2019年08月22日 16:51, Simon Horman 写道:
>> Hi Lianbo,
>>
>> I like where this patch is going but I would like to request a few changes.
>> Please see comments inline.
>>
> 
> Thanks for your comment, Simon.
> 
>> On Thu, Aug 15, 2019 at 11:37:55AM +0800, Lianbo Jiang wrote:
>>> Some code related to vmcore-dmesg.c is put into the util_lib, which
>>> is not very reasonable, so lets move it back and tidy up those code.
>>>
>>> In addition, that will also help to limit the size of vmcore-dmesg.txt.
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>>  kexec/arch/arm64/kexec-arm64.c |  2 +-
>>>  util_lib/elf_info.c| 73 --
>>>  util_lib/include/elf_info.h|  8 +++-
>>>  vmcore-dmesg/vmcore-dmesg.c| 44 +---
>>>  4 files changed, 61 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
>>> index eb3a3a37307c..6ad3b0a134b3 100644
>>> --- a/kexec/arch/arm64/kexec-arm64.c
>>> +++ b/kexec/arch/arm64/kexec-arm64.c
>>> @@ -889,7 +889,7 @@ int get_phys_base_from_pt_load(unsigned long 
>>> *phys_offset)
>>> return EFAILED;
>>> }
>>>  
>>> -   read_elf_kcore(fd);
>>> +   read_elf(fd);
>>>  
>>> for (i = 0; get_pt_load(i,
>>> _start, NULL, _start, NULL);
>>> diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
>>> index 90a3b21662e7..2f254e972721 100644
>>> --- a/util_lib/elf_info.c
>>> +++ b/util_lib/elf_info.c
>>> @@ -20,7 +20,6 @@
>>>  /* The 32bit and 64bit note headers make it clear we don't care */
>>>  typedef Elf32_Nhdr Elf_Nhdr;
>>>  
>>> -static const char *fname;
>>>  static Elf64_Ehdr ehdr;
>>>  static Elf64_Phdr *phdr;
>>>  static int num_pt_loads;
>>> @@ -120,8 +119,8 @@ void read_elf32(int fd)
>>>  
>>> ret = pread(fd, , sizeof(ehdr32), 0);
>>> if (ret != sizeof(ehdr32)) {
>>> -   fprintf(stderr, "Read of Elf header from %s failed: %s\n",
>>> -   fname, strerror(errno));
>>> +   fprintf(stderr, "Read of Elf header failed in %s: %s\n",
>>> +   __func__, strerror(errno));
>>
>> I'm not sure of the merit of changing the loging output.
> 
> The variable 'fname' is defined two twice, the first definition is in the 
> vmcore-dmesg.c, and the
> second definition is in the elf_info.c. That is confused although it's a 
> static type, because i do
> not see the place where the value of variable 'fname' is set in elf_info.c. 
> So i guess that it should
> be a same variable within the vmcore-dmesg.c and also need to clean up.
> 
BTW: i guess the original definition of 'fname' should look like this:

diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
index d9397ecd8626..5d0efaafab53 100644
--- a/util_lib/elf_info.c
+++ b/util_lib/elf_info.c
@@ -20,7 +20,7 @@
 /* The 32bit and 64bit note headers make it clear we don't care */
 typedef Elf32_Nhdr Elf_Nhdr;
 
-static const char *fname;
+const char *fname;
 static Elf64_Ehdr ehdr;
 static Elf64_Phdr *phdr;
 static int num_pt_loads;
diff --git a/vmcore-dmesg/vmcore-dmesg.c b/vmcore-dmesg/vmcore-dmesg.c
index 7a386b380291..bebc348a657e 100644
--- a/vmcore-dmesg/vmcore-dmesg.c
+++ b/vmcore-dmesg/vmcore-dmesg.c
@@ -3,7 +3,7 @@
 /* The 32bit and 64bit note headers make it clear we don't care */
 typedef Elf32_Nhdr Elf_Nhdr;
 
-static const char *fname;
+extern const char *fname;
 
 int main(int argc, char **argv)
 {

>> And moreover I don't think it belongs in this patch
>> as it doesn't seem related to the other changes.
>>
> 
> Good question. I will consider how to clean up. Probably, it should be a 
> separate patch.
> 
>>> exit(10);
>>> }
>>>  
>>> @@ -193,8 +192,8 @@ void read_elf64(int fd)
>>>  
>>> ret = pread(fd, , sizeof(ehdr64), 0);
>>> if (ret < 0 || (size_t)ret != sizeof(ehdr)) {
>>> -   fprintf(stderr, "Read of Elf header from %s failed: %s\n",
>>> -   fname, strerror(errno));
>>> +   fprintf(stderr, "Read of Elf header failed in %s: %s\n",
>>> +   __func__, strerror(errno));
>>> exit(10);
>>> }
>>>  
>>> @@ -531,19 +530,7 @@ static int32_t read_file_s32(int fd, uint64_t addr)
>>> return

Re: [PATCH 2/2] Limit the size of vmcore-dmesg.txt to 2G

2019-08-22 Thread lijiang

在 2019年08月22日 16:52, Simon Horman 写道:
> On Thu, Aug 15, 2019 at 11:37:56AM +0800, Lianbo Jiang wrote:
>> With some corrupted vmcore files, the vmcore-dmesg.txt file may grow
>> forever till the kdump disk becomes full, and also probably causes
>> the disk error messages as follow:
>> ...
>> sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET 
>> driverbyte=DRIVER_OK
>> sd 0:0:0:0: [sda] tag#6 CDB: Read(10) 28 00 08 06 4c 98 00 00 08 00
>> blk_update_request: I/O error, dev sda, sector 134630552
>> sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET 
>> driverbyte=DRIVER_OK
>> sd 0:0:0:0: [sda] tag#7 CDB: Read(10) 28 00 08 06 4c 98 00 00 08 00
>> blk_update_request: I/O error, dev sda, sector 134630552
>> ...
>>
>> If vmcore-dmesg.txt occupies the whole disk, the vmcore can not be
>> saved, this is also a problem.
>>
>> Lets limit the size of vmcore-dmesg.txt to avoid such problems.
>>
>> Signed-off-by: Lianbo Jiang 
> 
> Thanks, this looks good to me.
> 
> Please repost this patch with an updated version of Patch 1/2.
> 

OK, thank you, Simon. I will improve them and post again.

>> ---
>>  vmcore-dmesg/vmcore-dmesg.c | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/vmcore-dmesg/vmcore-dmesg.c b/vmcore-dmesg/vmcore-dmesg.c
>> index ff0d540c9130..5ada3566972b 100644
>> --- a/vmcore-dmesg/vmcore-dmesg.c
>> +++ b/vmcore-dmesg/vmcore-dmesg.c
>> @@ -1,8 +1,18 @@
>>  #include 
>>  
>> +/* stole this macro from kernel printk.c */
>> +#define LOG_BUF_LEN_MAX (uint32_t)(1 << 31)
>> +
>>  static void write_to_stdout(char *buf, unsigned int nr)
>>  {
>>  ssize_t ret;
>> +static uint32_t n_bytes = 0;
>> +
>> +n_bytes += nr;
>> +if (n_bytes > LOG_BUF_LEN_MAX) {
>> +fprintf(stderr, "The vmcore-dmesg.txt over 2G in size is not 
>> supported.\n");
>> +exit(55);
>> +}
>>  
>>  ret = write(STDOUT_FILENO, buf, nr);
>>  if (ret != nr) {
>> -- 
>> 2.17.1
>>
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] cleanup: move it back from util_lib/elf_info.c

2019-08-22 Thread lijiang

在 2019年08月22日 16:51, Simon Horman 写道:
> Hi Lianbo,
> 
> I like where this patch is going but I would like to request a few changes.
> Please see comments inline.
> 

Thanks for your comment, Simon.

> On Thu, Aug 15, 2019 at 11:37:55AM +0800, Lianbo Jiang wrote:
>> Some code related to vmcore-dmesg.c is put into the util_lib, which
>> is not very reasonable, so lets move it back and tidy up those code.
>>
>> In addition, that will also help to limit the size of vmcore-dmesg.txt.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  kexec/arch/arm64/kexec-arm64.c |  2 +-
>>  util_lib/elf_info.c| 73 --
>>  util_lib/include/elf_info.h|  8 +++-
>>  vmcore-dmesg/vmcore-dmesg.c| 44 +---
>>  4 files changed, 61 insertions(+), 66 deletions(-)
>>
>> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
>> index eb3a3a37307c..6ad3b0a134b3 100644
>> --- a/kexec/arch/arm64/kexec-arm64.c
>> +++ b/kexec/arch/arm64/kexec-arm64.c
>> @@ -889,7 +889,7 @@ int get_phys_base_from_pt_load(unsigned long 
>> *phys_offset)
>>  return EFAILED;
>>  }
>>  
>> -read_elf_kcore(fd);
>> +read_elf(fd);
>>  
>>  for (i = 0; get_pt_load(i,
>>  _start, NULL, _start, NULL);
>> diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
>> index 90a3b21662e7..2f254e972721 100644
>> --- a/util_lib/elf_info.c
>> +++ b/util_lib/elf_info.c
>> @@ -20,7 +20,6 @@
>>  /* The 32bit and 64bit note headers make it clear we don't care */
>>  typedef Elf32_Nhdr Elf_Nhdr;
>>  
>> -static const char *fname;
>>  static Elf64_Ehdr ehdr;
>>  static Elf64_Phdr *phdr;
>>  static int num_pt_loads;
>> @@ -120,8 +119,8 @@ void read_elf32(int fd)
>>  
>>  ret = pread(fd, , sizeof(ehdr32), 0);
>>  if (ret != sizeof(ehdr32)) {
>> -fprintf(stderr, "Read of Elf header from %s failed: %s\n",
>> -fname, strerror(errno));
>> +fprintf(stderr, "Read of Elf header failed in %s: %s\n",
>> +__func__, strerror(errno));
> 
> I'm not sure of the merit of changing the loging output.

The variable 'fname' is defined two twice, the first definition is in the 
vmcore-dmesg.c, and the
second definition is in the elf_info.c. That is confused although it's a static 
type, because i do
not see the place where the value of variable 'fname' is set in elf_info.c. So 
i guess that it should
be a same variable within the vmcore-dmesg.c and also need to clean up.

> And moreover I don't think it belongs in this patch
> as it doesn't seem related to the other changes.
> 

Good question. I will consider how to clean up. Probably, it should be a 
separate patch.

>>  exit(10);
>>  }
>>  
>> @@ -193,8 +192,8 @@ void read_elf64(int fd)
>>  
>>  ret = pread(fd, , sizeof(ehdr64), 0);
>>  if (ret < 0 || (size_t)ret != sizeof(ehdr)) {
>> -fprintf(stderr, "Read of Elf header from %s failed: %s\n",
>> -fname, strerror(errno));
>> +fprintf(stderr, "Read of Elf header failed in %s: %s\n",
>> +__func__, strerror(errno));
>>  exit(10);
>>  }
>>  
>> @@ -531,19 +530,7 @@ static int32_t read_file_s32(int fd, uint64_t addr)
>>  return read_file_u32(fd, addr);
>>  }
>>  
>> -static void write_to_stdout(char *buf, unsigned int nr)
>> -{
>> -ssize_t ret;
>> -
>> -ret = write(STDOUT_FILENO, buf, nr);
>> -if (ret != nr) {
>> -fprintf(stderr, "Failed to write out the dmesg log buffer!:"
>> -" %s\n", strerror(errno));
>> -exit(54);
>> -}
>> -}
>> -
>> -static void dump_dmesg_legacy(int fd)
>> +void dump_dmesg_legacy(int fd, handler_t handler)
>>  {
>>  uint64_t log_buf, log_buf_offset;
>>  unsigned log_end, logged_chars, log_end_wrapped;
>> @@ -604,7 +591,7 @@ static void dump_dmesg_legacy(int fd)
>>   */
>>  logged_chars = log_end < log_buf_len ? log_end : log_buf_len;
>>  
>> -write_to_stdout(buf + (log_buf_len - logged_chars), logged_chars);
>> +handler(buf + (log_buf_len - logged_chars), logged_chars);
>>  }
>>  
>>  static inline uint16_t struct_val_u16(char *ptr, unsigned int offset)
>> @@ -623,7 +610,7 @@ static inline uint64_t struct_val_u64(char *ptr, 
>> unsigned int offset)
>>  }
>>  
>>  /* Read headers of log records and dump accordingly */
>> -static void dump_dmesg_structured(int fd)
>> +void dump_dmesg_structured(int fd, handler_t handler)
>>  {
>>  #define OUT_BUF_SIZE4096
>>  uint64_t log_buf, log_buf_offset, ts_nsec;
>> @@ -733,7 +720,7 @@ static void dump_dmesg_structured(int fd)
>>  out_buf[len++] = c;
>>  
>>  if (len >= OUT_BUF_SIZE - 64) {
>> -write_to_stdout(out_buf, len);
>> +handler(out_buf, len);
>>  len = 0;
>>  }
>>  }
>>

Re: crash: `kmem -s` reported "kmem: dma-kmalloc-512: slab: ffffe192c0001000 invalid freepointer: e5ffef4e9a040b7e" on a dumped vmcore

2019-08-17 Thread lijiang

在 2019年08月11日 10:29, lijiang 写道:
> 在 2019年08月09日 06:37, Lendacky, Thomas 写道:
>> On 8/1/19 8:05 PM, Dave Young wrote:
>>> Add kexec cc list.
>>> On 08/01/19 at 11:02pm, lijiang wrote:
>>>> Hi, Tom
>>>>
>>>> Recently, i ran into a problem about SME and used crash tool to check the 
>>>> vmcore as follow:
>>>>
>>>> crash> kmem -s | grep -i invalid
>>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>>> e5ffef4e9a040b7e
>>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>>> e5ffef4e9a040b7e
>>>>
>>>> And the crash tool reported the above error, probably, the main reason is 
>>>> that kernel does not
>>>> correctly handle the first 640k region when SME is enabled.
>>>>
>>>> When SME is enabled, the kernel and initramfs images are loaded into the 
>>>> decrypted memory, and
>>>> the backup area(first 640k) is also mapped as decrypted, but the first 
>>>> 640k data is copied to
>>>> the backup area in purgatory(). Please refer to this file: 
>>>> arch/x86/purgatory/purgatory.c
>>>> ..
>>>> static int copy_backup_region(void)
>>>> {
>>>>  if (purgatory_backup_dest) {
>>>>  memcpy((void *)purgatory_backup_dest,
>>>> (void *)purgatory_backup_src, purgatory_backup_sz);
>>>>  }
>>>>  return 0;
>>>> }
>>>> ..
>>>>
>>>> arch/x86/kernel/machine_kexec_64.c
>>>> ..
>>>> machine_kexec_prepare()->
>>>> arch_update_purgatory()->
>>>> .
>>>>
>>>> Actually, the firs 640k area is encrypted in the first kernel when SME is 
>>>> enabled, here kernel
>>>> copies the first 640k data to the backup area in purgatory(), because the 
>>>> backup area is mapped
>>>> as decrypted, this copying operation makes that the first 640k data is 
>>>> decrypted(decoded) and
>>>> saved to the backup area, but probably kernel can not aware of SME in 
>>>> purgatory(), which causes
>>>> kernel mistakenly read out the first 640k.
>>>>
>>>> In addition, i hacked kernel code as follow:
>>>>
>>>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>>>> index 7bcc92add72c..a51631d36a7a 100644
>>>> --- a/fs/proc/vmcore.c
>>>> +++ b/fs/proc/vmcore.c
>>>> @@ -377,6 +378,16 @@ static ssize_t __read_vmcore(char *buffer, size_t 
>>>> buflen, loff_t *fpos,
>>>>  m->offset + m->size - *fpos,
>>>>  buflen);
>>>>  start = m->paddr + *fpos - m->offset;
>>>> +   if (m->paddr == 0x73f6) {//the backup area's 
>>>> start address:0x73f6
>>>> +   tmp = read_from_oldmem(buffer, tsz, ,
>>>> +   userbuf, false);
>>>> +   } else
>>>>  tmp = read_from_oldmem(buffer, tsz, 
>>>> ,
>>>> userbuf, 
>>>> mem_encrypt_active());
>>>>  if (tmp < 0)
>>>>
>>>> Here, i used the crash tool to check the vmcore, i can see that the backup 
>>>> area is decrypted,
>>>> except for the dma-kmalloc-512. So i suspect that kernel did not correctly 
>>>> read out the first
>>>> 640k data to backup area. Do you happen to know how to deal with the first 
>>>> 640k area in purgatory()
>>>> when SME is enabled? Any idea?
>>
>> I'm not all that familiar with kexec and purgatory, etc., but I think
>> that you want to setup the page table that is active when purgatory runs
>> so that the src and dest both have the SME encryption mask set in their
>> respective page table entries. This way, when the copy is performed,
>> everything is copied correctly. 
> 
> Exactly. That's just what i was thinking.
> 

I tried to setup the 1:1 mapping in the init_pgtable() with the memory 
encryption mask, but that still
did not correctly access the encrypted memory in purgatory(). I'm not sure 
whether i missed anything
else, i'm still digging into it.

Re: crash: `kmem -s` reported "kmem: dma-kmalloc-512: slab: ffffe192c0001000 invalid freepointer: e5ffef4e9a040b7e" on a dumped vmcore

2019-08-10 Thread lijiang

在 2019年08月09日 06:37, Lendacky, Thomas 写道:
> On 8/1/19 8:05 PM, Dave Young wrote:
>> Add kexec cc list.
>> On 08/01/19 at 11:02pm, lijiang wrote:
>>> Hi, Tom
>>>
>>> Recently, i ran into a problem about SME and used crash tool to check the 
>>> vmcore as follow:
>>>
>>> crash> kmem -s | grep -i invalid
>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>> e5ffef4e9a040b7e
>>> kmem: dma-kmalloc-512: slab: e192c0001000 invalid freepointer: 
>>> e5ffef4e9a040b7e
>>>
>>> And the crash tool reported the above error, probably, the main reason is 
>>> that kernel does not
>>> correctly handle the first 640k region when SME is enabled.
>>>
>>> When SME is enabled, the kernel and initramfs images are loaded into the 
>>> decrypted memory, and
>>> the backup area(first 640k) is also mapped as decrypted, but the first 640k 
>>> data is copied to
>>> the backup area in purgatory(). Please refer to this file: 
>>> arch/x86/purgatory/purgatory.c
>>> ..
>>> static int copy_backup_region(void)
>>> {
>>>  if (purgatory_backup_dest) {
>>>  memcpy((void *)purgatory_backup_dest,
>>> (void *)purgatory_backup_src, purgatory_backup_sz);
>>>  }
>>>  return 0;
>>> }
>>> ..
>>>
>>> arch/x86/kernel/machine_kexec_64.c
>>> ..
>>> machine_kexec_prepare()->
>>> arch_update_purgatory()->
>>> .
>>>
>>> Actually, the firs 640k area is encrypted in the first kernel when SME is 
>>> enabled, here kernel
>>> copies the first 640k data to the backup area in purgatory(), because the 
>>> backup area is mapped
>>> as decrypted, this copying operation makes that the first 640k data is 
>>> decrypted(decoded) and
>>> saved to the backup area, but probably kernel can not aware of SME in 
>>> purgatory(), which causes
>>> kernel mistakenly read out the first 640k.
>>>
>>> In addition, i hacked kernel code as follow:
>>>
>>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>>> index 7bcc92add72c..a51631d36a7a 100644
>>> --- a/fs/proc/vmcore.c
>>> +++ b/fs/proc/vmcore.c
>>> @@ -377,6 +378,16 @@ static ssize_t __read_vmcore(char *buffer, size_t 
>>> buflen, loff_t *fpos,
>>>  m->offset + m->size - *fpos,
>>>  buflen);
>>>  start = m->paddr + *fpos - m->offset;
>>> +   if (m->paddr == 0x73f6) {//the backup area's 
>>> start address:0x73f6
>>> +   tmp = read_from_oldmem(buffer, tsz, ,
>>> +   userbuf, false);
>>> +   } else
>>>  tmp = read_from_oldmem(buffer, tsz, ,
>>> userbuf, 
>>> mem_encrypt_active());
>>>  if (tmp < 0)
>>>
>>> Here, i used the crash tool to check the vmcore, i can see that the backup 
>>> area is decrypted,
>>> except for the dma-kmalloc-512. So i suspect that kernel did not correctly 
>>> read out the first
>>> 640k data to backup area. Do you happen to know how to deal with the first 
>>> 640k area in purgatory()
>>> when SME is enabled? Any idea?
> 
> I'm not all that familiar with kexec and purgatory, etc., but I think
> that you want to setup the page table that is active when purgatory runs
> so that the src and dest both have the SME encryption mask set in their
> respective page table entries. This way, when the copy is performed,
> everything is copied correctly. 

Exactly. That's just what i was thinking.

> Remember, encrypted data from one page
> cannot be directly copied as unencrypted data and decrypted properly in
> the new location (e.g. a page of zeroes encrypted at one address will not
> appear the same as a page of zeroes encrypted at a different address).

Yes, that's right. Thank you, Tom.

I'm considering how to solve it, and i guess that probably it needs to properly 
deal with
this problem in purgatory().

Thanks.
Lianbo

> 
> Thanks,
> Tom
> 
>>>
>>> BTW: I' curious the reason why the address of dma-kmalloc-512k always falls 
>>> into the first 640k
>>> region, and i did not see the same issue on another machine.
>>>
>>> Machine:
>>> Serial Number   diesel-sys9079-0001
>>> Model   AMD Diesel (A0C)
>>> CPU AMD EPYC 7601 32-Core Processor
>>>
>>>
>>> Background:
>>> On x86_64, the first 640k region is special because of some historical 
>>> reasons. And kdump kernel will
>>> reuse the first 640k region, so kernel will back up(copy) the first 640k 
>>> region to a backup area in
>>> purgatory(), in order not to rewrite the old region(640k) in kdump kernel, 
>>> which makes sure that kdump
>>> can read out the old memory from vmcore.
>>>
>>>
>>> Thanks.
>>> Lianbo

Re: [PATCH 2/3 v3] x86/kexec: Set the C-bit in the identity map page table when SEV is active

2019-06-11 Thread lijiang

在 2019年05月16日 16:15, Boris Petkov 写道:
> On May 16, 2019 3:12:26 AM GMT+02:00, lijiang  wrote:
>> OK, i will modify it according to your suggestion and post again.
> 
> No need - i fixed it up already. 
> 

Hi, until now, i haven't seen the upstream branch pick up this patch series,
any updates?

Thanks.
Lianbo

Re: [PATCH 0/3 v11] add reserved e820 ranges to the kdump kernel e820 table

2019-06-11 Thread lijiang

在 2019年06月10日 19:37, Borislav Petkov 写道:
> On Sat, Jun 08, 2019 at 06:26:59PM +0800, Baoquan He wrote:
>> OK, I see. Then it should be the issue we have met and talked about with
>> Tom.
>> https://lkml.kernel.org/r/20190604134952.GC26891@MiWiFi-R3L-srv
>>
>> You can apply Tom's patch as below. I tested it, it can make kexec
>> kernel succeed to boot, but failed for kdump kernel booting. The kdump
>> kernel can boot till the end of kernel initialization, then hang with a
>> call trace. I have pasted the log in the above thread. Haven't got the
>> reason.
>> http://lkml.kernel.org/r/508c2853-dc4f-70a6-6fa8-97c950dc3...@amd.com
> 
> I can confirm the same observation.
> 
Currently, i haven't seen any updates yet, so i'm not sure whether this patch
passed your test.

Thanks.
Lianbo

> Thx.
>

Re: The current SME implementation fails kexec/kdump kernel booting.

2019-06-11 Thread lijiang

在 2019年06月09日 11:45, lijiang 写道:
> 在 2019年06月06日 00:04, Lendacky, Thomas 写道:
>> On 6/4/19 7:56 PM, Baoquan He wrote:
>>> On 06/04/19 at 03:56pm, Lendacky, Thomas wrote:
>>>> On 6/4/19 8:49 AM, Baoquan He wrote:
>>>>> Hi Tom,
>>>>>
>>>>> Lianbo reported kdump kernel can't boot well with 'nokaslr' added, and
>>>>> have to enable KASLR in kdump kernel to make it boot successfully. This
>>>>> blocked his work on enabling sme for kexec/kdump. And on some machines
>>>>> SME kernel can't boot in 1st kernel.
>>>>>
>>>>> I checked code of SME implementation, and found out the root cause. The
>>>>> above failures are caused by SME code, sme_encrypt_kernel(). In
>>>>> sme_encrypt_kernel(), you get a 2M of encryption work area as intermediate
>>>>> buffer to encrypt kernel in-place. And the work area is just after _end of
>>>>> kernel.
>>>>
>>>> I remember worrying about something like this back when I was testing the
>>>> kexec support. I had come up with a patch to address it, but never got the
>>>> time to test and submit it.  I've included it here if you'd like to test
>>>> it (I haven't done run this patch in quite some time). If it works, we can
>>>> think about submitting it.
>>>
>>> Thanks for your quick response and making this patch, Tom.
>>>
>>> Tested on a speedway machine, it entered into kernel, but failed in
>>> below stage. Tested two times, always happened.
>>
>> Is this the initial kernel boot or the kexec kernel boot?
>>
>> It looks like this is related to the initrd/initramfs decryption. Not
>> sure what could be happening there. I just tried the patch on my Naples
>> system and a 5.2.0-rc3 kernel and have been able to repeatedly kexec boot
>> a number of times so far.
>>
> 
> I used the hacked kexec-tools(by Baoquan) to test it, the kexec-d kernel and
> kdump kernel worked well. But Tom's patch only worked for the kexec-d kernel,
> and the kdump kernel did not work(kdump kernel could not successfully boot).
> What's the difference between them?
> 

After applied Tom's patch, i changed the reserved memory(for crash kernel) to 
the
above 256M(>256M), such as crashkernel=320M or 384M,512M..., the kdump kernel 
can
work and successfully dump the vmcore.

But the kdump kernel always happened the panic or could not boot successfully in
the 256M(<= 256M) case, and on HP machine, i noticed that it printed OOM, the 
kdump
kernel was too smaller memory. But i never see the OOM on speedway 
machine(probably
related to the earlyprintk, it doesn't work and it loses many logs).

After removing the option 'CONFIG_DEBUG_INFO' from .config, i tested again, the 
kdump
kernel did not happen the panic in the 256M(crashkernel=256M), the kdump kernel 
can
work and succeed to dump the vmcore on HP machine or speedway machine.

It seems that the small memory caused the previous failure in kdump kernel. I 
would
suggest to post this patch to upstream. What's your opinion? Tom, Baoquan and 
other
people. Or do you have any comment?

Thanks.
Lianbo

> Thanks
> Lianbo
> 
>> Thanks,
>> Tom
>>
>>>
>>>
>>> [4.978521] Freeing unused decrypted memory: 2040K
>>> [4.983800] Freeing unused kernel image memory: 2344K
>>> [4.988943] Write protecting the kernel read-only data: 18432k
>>> [4.995306] Freeing unused kernel image memory: 2012K
>>> [5.000488] Freeing unused kernel image memory: 256K
>>> [5.005540] Run /init as init process
>>> [5.009443] Kernel panic - not syncing: Attempted to kill init! 
>>> exitcode=0x7f00
>>> [5.017230] CPU: 0 PID: 1 Comm: init Not tainted 5.2.0-rc2+ #38
>>> [5.023251] Hardware name: AMD Corporation Speedway/Speedway, BIOS 
>>> RSW1004B 10/18/2017
>>> [5.031299] Call Trace:
>>> [5.033793]  dump_stack+0x46/0x60
>>> [5.037169]  panic+0xfb/0x2cb
>>> [5.040191]  do_exit.cold.21+0x59/0x81
>>> [5.044004]  do_group_exit+0x3a/0xa0
>>> [5.047640]  __x64_sys_exit_group+0x14/0x20
>>> [5.051899]  do_syscall_64+0x55/0x1c0
>>> [5.055627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [5.060764] RIP: 0033:0x7fa1b1fc9e2e
>>> [5.064404] Code: Bad RIP value.
>>> [5.067687] RSP: 002b:7fffc5abb778 EFLAGS: 0202 ORIG_RAX: 
>>> 00e7
>>> [5.075296] RAX: ffda RBX: 7fa1b1fd2528 RCX: 
>>> 7fa1b1fc9e2e
>>> [5.082625] RDX: 007f RSI: 003c RDI: 
>>> 007f
>>> [5.089879] RBP: 7fa1b21d8d00 R08: 00e7 R09: 
>>> 7fffc5abb688
>>> [5.097134] R10:  R11: 0202 R12: 
>>> 0002
>>> [5.104386] R13: 0001 R14: 7fa1b21d8d40 R15: 
>>> 7fa1b21d8d30
>>> [5.111645] Kernel Offset: disabled
>>> [5.423002] Rebooting in 10 seconds..
>>> [   15.429641] ACPI MEMORY or I/O RESET_REG.
>>>

Re: [PATCH 0/3 v11] add reserved e820 ranges to the kdump kernel e820 table

2019-06-08 Thread lijiang

在 2019年06月08日 01:42, Borislav Petkov 写道:
> On Tue, May 28, 2019 at 03:30:21PM +0800, lijiang wrote:
>> Hi, Boris and Thomas
>>
>> Could you give me any suggestions about this patch series? Other reviewers?
> 
> So I'm testing this on a box with SME enabled but after loading the
> crash kernel, it freezes instead of rebooting. My cmdline is:
> 
>  kexec -s -p /boot/vmlinuz-5.2.0-rc3+ --initrd=/boot/initrd.img-5.2.0-rc3+ 
> --command-line="maxcpus=1 root=/dev/sda5 ro debug ignore_loglevel 
> log_buf_len=16M no_console_suspend net.ifnames=0 systemd.log_target=null 
> mem_encrypt=on kvm_amd.sev=1 nr_cpus=1 irqpoll reset_devices vga=normal 
> LANG=en_US.UTF-8 earlyprintk=serial cgroup_disable=memory mce=off numa=off 
> udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug 
> transparent_hugepage=never disable_cpu_apicid=0"
> 
> and the reserved range is:
> 
> [0.00] Reserving 256MB of memory at 3392MB for crashkernel (System 
> RAM: 16271MB)
> 
> I'm wondering if it is related to
> 
> https://lkml.kernel.org/r/20190604134952.GC26891@MiWiFi-R3L-srv
> 
Yes. It should be a SME issue.

Thanks.
Lianbo

> Thx.
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: The current SME implementation fails kexec/kdump kernel booting.

2019-06-08 Thread lijiang

在 2019年06月06日 00:04, Lendacky, Thomas 写道:
> On 6/4/19 7:56 PM, Baoquan He wrote:
>> On 06/04/19 at 03:56pm, Lendacky, Thomas wrote:
>>> On 6/4/19 8:49 AM, Baoquan He wrote:
 Hi Tom,

 Lianbo reported kdump kernel can't boot well with 'nokaslr' added, and
 have to enable KASLR in kdump kernel to make it boot successfully. This
 blocked his work on enabling sme for kexec/kdump. And on some machines
 SME kernel can't boot in 1st kernel.

 I checked code of SME implementation, and found out the root cause. The
 above failures are caused by SME code, sme_encrypt_kernel(). In
 sme_encrypt_kernel(), you get a 2M of encryption work area as intermediate
 buffer to encrypt kernel in-place. And the work area is just after _end of
 kernel.
>>>
>>> I remember worrying about something like this back when I was testing the
>>> kexec support. I had come up with a patch to address it, but never got the
>>> time to test and submit it.  I've included it here if you'd like to test
>>> it (I haven't done run this patch in quite some time). If it works, we can
>>> think about submitting it.
>>
>> Thanks for your quick response and making this patch, Tom.
>>
>> Tested on a speedway machine, it entered into kernel, but failed in
>> below stage. Tested two times, always happened.
> 
> Is this the initial kernel boot or the kexec kernel boot?
> 
> It looks like this is related to the initrd/initramfs decryption. Not
> sure what could be happening there. I just tried the patch on my Naples
> system and a 5.2.0-rc3 kernel and have been able to repeatedly kexec boot
> a number of times so far.
> 

I used the hacked kexec-tools(by Baoquan) to test it, the kexec-d kernel and
kdump kernel worked well. But Tom's patch only worked for the kexec-d kernel,
and the kdump kernel did not work(kdump kernel could not successfully boot).
What's the difference between them?

Thanks
Lianbo

> Thanks,
> Tom
> 
>>
>>
>> [4.978521] Freeing unused decrypted memory: 2040K
>> [4.983800] Freeing unused kernel image memory: 2344K
>> [4.988943] Write protecting the kernel read-only data: 18432k
>> [4.995306] Freeing unused kernel image memory: 2012K
>> [5.000488] Freeing unused kernel image memory: 256K
>> [5.005540] Run /init as init process
>> [5.009443] Kernel panic - not syncing: Attempted to kill init! 
>> exitcode=0x7f00
>> [5.017230] CPU: 0 PID: 1 Comm: init Not tainted 5.2.0-rc2+ #38
>> [5.023251] Hardware name: AMD Corporation Speedway/Speedway, BIOS 
>> RSW1004B 10/18/2017
>> [5.031299] Call Trace:
>> [5.033793]  dump_stack+0x46/0x60
>> [5.037169]  panic+0xfb/0x2cb
>> [5.040191]  do_exit.cold.21+0x59/0x81
>> [5.044004]  do_group_exit+0x3a/0xa0
>> [5.047640]  __x64_sys_exit_group+0x14/0x20
>> [5.051899]  do_syscall_64+0x55/0x1c0
>> [5.055627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [5.060764] RIP: 0033:0x7fa1b1fc9e2e
>> [5.064404] Code: Bad RIP value.
>> [5.067687] RSP: 002b:7fffc5abb778 EFLAGS: 0202 ORIG_RAX: 
>> 00e7
>> [5.075296] RAX: ffda RBX: 7fa1b1fd2528 RCX: 
>> 7fa1b1fc9e2e
>> [5.082625] RDX: 007f RSI: 003c RDI: 
>> 007f
>> [5.089879] RBP: 7fa1b21d8d00 R08: 00e7 R09: 
>> 7fffc5abb688
>> [5.097134] R10:  R11: 0202 R12: 
>> 0002
>> [5.104386] R13: 0001 R14: 7fa1b21d8d40 R15: 
>> 7fa1b21d8d30
>> [5.111645] Kernel Offset: disabled
>> [5.423002] Rebooting in 10 seconds..
>> [   15.429641] ACPI MEMORY or I/O RESET_REG.
>>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3 v11] add reserved e820 ranges to the kdump kernel e820 table

2019-05-28 Thread lijiang

Hi, Boris and Thomas

Could you give me any suggestions about this patch series? Other reviewers?

Thanks.
Lianbo

在 2019年04月23日 09:30, Lianbo Jiang 写道:
> This patchset did three things:
> 
> a). x86/e820, resource: add a new I/O resource descriptor 'IORES_DESC_
> RESERVED'
> 
> b). x86/mm: change the check condition in SEV because a new descriptor is
> introduced
> 
> c). x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table
> 
> Changes since v1:
> 1. Modified the value of flags to "0", when walking through the whole
> tree for e820 reserved ranges.
> 
> Changes since v2:
> 1. Modified the value of flags to "0", when walking through the whole
> tree for e820 reserved ranges.
> 2. Modified the invalid SOB chain issue.
> 
> Changes since v3:
> 1. Dropped [PATCH 1/3 v3] resource: fix an error which walks through iomem
>resources. Please refer to this commit <010a93bf97c7> "resource: Fix
>find_next_iomem_res() iteration issue"
> 
> Changes since v4:
> 1. Improve the patch log, and add kernel log.
> 
> Changes since v5:
> 1. Rewrite these patches log.
> 
> Changes since v6:
> 1. Modify the [PATCH 1/2], and add the new I/O resource descriptor
>'IORES_DESC_RESERVED' for the iomem resources search interfaces,
>and also updates these codes relates to 'IORES_DESC_NONE'.
> 2. Modify the [PATCH 2/2], and walk through io resource based on the
>new descriptor 'IORES_DESC_RESERVED'.
> 3. Update patch log.
> 
> Changes since v7:
> 1. Improve patch log.
> 2. Improve this function __ioremap_check_desc_other().
> 3. Modify code comment in the __ioremap_check_desc_other()
> 
> Changes since v8:
> 1. Get rid of all changes about ia64.(Borislav's suggestion)
> 2. Change the examination condition to the 'IORES_DESC_ACPI_*'.
> 3. Modify the signature. This patch(add the new I/O resource
>descriptor 'IORES_DESC_RESERVED') was suggested by Boris.
> 
> Changes since v9:
> 1. Improve patch log.
> 2. No need to modify the kernel/resource.c, so correct them.
> 3. Change the name of the __ioremap_check_desc_other() to
>__ioremap_check_desc_none_and_reserved(), and modify the
>check condition, add comment above it.
> 
> Changes since v10:
> 1. Split them into three patches, the second patch is currently added.
> 2. Change struct ioremap_mem_flags to struct ioremap_desc and redefine
> it.
> 3. Change the name of the __ioremap_check_desc_other() to
> __ioremap_check_desc().
> 4. Change the check condition in SEV and also improve them.
> 5. Modify the return value for some functions.
> 
> Lianbo Jiang (3):
>   x86/e820, resource: add a new I/O resource descriptor
> 'IORES_DESC_RESERVED'
>   x86/mm: change the check condition in SEV because a new descriptor is
> introduced
>   x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table
> 
>  arch/x86/kernel/crash.c |  6 +
>  arch/x86/kernel/e820.c  |  2 +-
>  arch/x86/mm/ioremap.c   | 59 ++---
>  include/linux/ioport.h  | 10 +++
>  4 files changed, 54 insertions(+), 23 deletions(-)
>

Re: [PATCH 2/3 v3] x86/kexec: Set the C-bit in the identity map page table when SEV is active

2019-05-16 Thread lijiang

在 2019年05月16日 16:15, Boris Petkov 写道:
> On May 16, 2019 3:12:26 AM GMT+02:00, lijiang  wrote:
>> OK, i will modify it according to your suggestion and post again.
> 
> No need - i fixed it up already. 
> 
OK, thank you very much.

Lianbo

Re: [PATCH 2/3 v3] x86/kexec: Set the C-bit in the identity map page table when SEV is active

2019-05-15 Thread lijiang

在 2019年05月15日 21:30, Borislav Petkov 写道:
> On Tue, Apr 30, 2019 at 03:44:20PM +0800, Lianbo Jiang wrote:
>> When SEV is active, the second kernel image is loaded into the
>> encrypted memory. Lets make sure that when kexec builds the
>> identity mapping page table it adds the memory encryption mask(C-bit).
>>
>> Co-developed-by: Brijesh Singh 
>> Signed-off-by: Brijesh Singh 
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/kernel/machine_kexec_64.c | 12 +++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index f60611531d17..11fe352f7344 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -56,6 +56,7 @@ static int init_transition_pgtable(struct kimage *image, 
>> pgd_t *pgd)
>>  pte_t *pte;
>>  unsigned long vaddr, paddr;
>>  int result = -ENOMEM;
>> +pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
>>  
>>  vaddr = (unsigned long)relocate_kernel;
>>  paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
>> @@ -92,7 +93,11 @@ static int init_transition_pgtable(struct kimage *image, 
>> pgd_t *pgd)
>>  set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
>>  }
>>  pte = pte_offset_kernel(pmd, vaddr);
>> -set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL_EXEC_NOENC));
>> +
>> +if (sev_active())
>> +prot = PAGE_KERNEL_EXEC;
>> +
>> +set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
>>  return 0;
>>  err:
>>  return result;
>> @@ -129,6 +134,11 @@ static int init_pgtable(struct kimage *image, unsigned 
>> long start_pgtable)
>>  level4p = (pgd_t *)__va(start_pgtable);
>>  clear_page(level4p);
>>  
>> +if (sev_active()) {
>> +info.page_flag |= _PAGE_ENC;
>> +info.kernpg_flag = _KERNPG_TABLE;
> 
> kernpg_flag above is initialized to _KERNPG_TABLE_NOENC so you can do here
> 
>   info.kernpg_flag |= _PAGE_ENC;
> 
> too, to make it even more clear what this does, right?
> 
OK, i will modify it according to your suggestion and post again.

Thanks.
Lianbo

> IOW:
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index 783ce5184405..16c37fe489bc 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -135,8 +135,8 @@ static int init_pgtable(struct kimage *image, unsigned 
> long start_pgtable)
> clear_page(level4p);
>  
> if (sev_active()) {
> -   info.page_flag |= _PAGE_ENC;
> -   info.kernpg_flag = _KERNPG_TABLE;
> +   info.page_flag   |= _PAGE_ENC;
> +   info.kernpg_flag |= _PAGE_ENC;
> }
>  
> if (direct_gbpages)
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec/x86: Unconditionally add the acpi_rsdp command line

2019-05-13 Thread lijiang

在 2019年05月13日 14:40, Kairui Song 写道:
> On Fri, Mar 15, 2019 at 5:36 PM Lianbo Jiang  wrote:
>>
>> The Linux kernel commit 3a63f70bf4c3 introduces the early parsing
>> of the RSDP. This means that boot loader must either set the
>> boot_params.acpi_rsdp_addr or pass a command line 'acpi_rsdp=xxx'
>> to tell the RDSP physical address.
>>
>> Currently, kexec neither sets the boot_params.acpi_rsdp or passes
>> acpi_rsdp command line if it sees the first kernel support efi
>> runtime. This is causing the second kernel boot failure.
>> The EFI runtime is not available so early in the boot process so
>> unconditionally pass the 'acpi_rsdp=xxx' to the second kernel.
>>
>> Signed-off-by: Lianbo Jiang 
>> Signed-off-by: Brijesh Singh 
>> ---
>>  kexec/arch/i386/crashdump-x86.c | 17 +
>>  1 file changed, 1 insertion(+), 16 deletions(-)
>>
>> diff --git a/kexec/arch/i386/crashdump-x86.c 
>> b/kexec/arch/i386/crashdump-x86.c
>> index 140f45b..a29b15b 100644
>> --- a/kexec/arch/i386/crashdump-x86.c
>> +++ b/kexec/arch/i386/crashdump-x86.c
>> @@ -35,7 +35,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  #include "../../kexec.h"
>>  #include "../../kexec-elf.h"
>>  #include "../../kexec-syscall.h"
>> @@ -772,18 +771,6 @@ static enum coretype get_core_type(struct 
>> crash_elf_info *elf_info,
>> }
>>  }
>>
>> -static int sysfs_efi_runtime_map_exist(void)
>> -{
>> -   DIR *dir;
>> -
>> -   dir = opendir("/sys/firmware/efi/runtime-map");
>> -   if (!dir)
>> -   return 0;
>> -
>> -   closedir(dir);
>> -   return 1;
>> -}
>> -
>>  /* Appends 'acpi_rsdp=' commandline for efi boot crash dump */
>>  static void cmdline_add_efi(char *cmdline)
>>  {
>> @@ -978,9 +965,7 @@ int load_crashdump_segments(struct kexec_info *info, 
>> char* mod_cmdline,
>> dbgprintf("Created elf header segment at 0x%lx\n", elfcorehdr);
>> if (delete_memmap(memmap_p, _memmap, elfcorehdr, memsz) < 0)
>> return -1;
>> -   if (!bzImage_support_efi_boot || arch_options.noefi ||
>> -   !sysfs_efi_runtime_map_exist())
>> -   cmdline_add_efi(mod_cmdline);
>> +   cmdline_add_efi(mod_cmdline);
>> cmdline_add_elfcorehdr(mod_cmdline, elfcorehdr);
>>
>> /* Inform second kernel about the presence of ACPI tables. */
>> --
>> 2.17.1
>>
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> 
> Hi Lianbo,
> 
> I've sent another patch similiar to yours:
> [PATCH] x86: Always try to fill acpi_rsdp_addr in boot params
> 
> I'll update V2 and your use case should also be covered in that patch,
> as we have talked in IRC previously, thanks!

OK. I noticed that the RSDP parsing was disabled in upsream kernel. Please
refer to the following heading:

"x86/boot: Disable RSDP parsing temporarily"

So, for this case, no longer need it. Please ignore it.

Thanks.
Lianbo
> 
> --
> Best Regards,
> Kairui Song
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3 v2] x86/kexec: Do not map the kexec area as decrypted when SEV is active

2019-04-27 Thread lijiang

在 2019年04月26日 21:02, Borislav Petkov 写道:
> On Fri, Apr 26, 2019 at 09:59:54AM +0800, lijiang wrote:
>> Hope this help. Thanks.
> 
> It does help, yes. When this explanation above is part of the commit
> message, it helps immensely!

OK, i will add them to the commit message and post again.

> 
> :-)
> 
>> So sorry for the delay, i am trying my best to explain it in detail.
> 
> I don't care about the delay as long as the commit messages properly
> explain why the change is needed.
> 
> So thanks for doing that.
> 

It's my pleasure. Thanks.

Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3 v2] x86/kexec: Do not map the kexec area as decrypted when SEV is active

2019-04-25 Thread lijiang

在 2019年04月02日 18:32, Borislav Petkov 写道:
> On Wed, Mar 27, 2019 at 01:36:27PM +0800, Lianbo Jiang wrote:
>> Currently, the arch_kexec_post_{alloc,free}_pages() unconditionally
>> maps the kexec area as decrypted. This works fine when SME is active.
>> Because in SME, the first kernel is loaded in decrypted area by the
>> BIOS, so the second kernel must be also loaded into the decrypted
>> memory.
>>
>> When SEV is active, the first kernel is loaded into the encrypted
>> area, so the second kernel must be also loaded into the encrypted
>> memory. Lets make sure that arch_kexec_post_{alloc,free}_pages()
>> does not clear the memory encryption mask from the kexec area when
>> SEV is active.
> 
> This commit message still doesn't explain the big picture why you want
> this change.
> 

When a virtual machine panic, we also need to dump its memory for analysis.
But, for the SEV virtual machine, the memory is encrypted. To support the
SEV kdump, these changes would be necessary, otherwise, it will not work.

Lets consider the following situations:

[1] How to load the images(kernel and initrd) when SEV is enabled in the
first kernel?

Based on the amd-memory-encryption.txt and SEV's patch series, the boot
images must be encrypted before guest(VM) can be booted(Please see Secure
Encrypted Virutualization Key Management 'Launching a guest(usage flow)').
Naturally use the similar way to load the images(kernel and initrd) to the
crash reserved areas, and these areas are encrypted when SEV is active.

That is to say, when SEV is active in the first kernel, need to load the
kernel and initrd to the encrypted areas, so i did the following changes:

[a] Do not map the kexec area as decrypted when SEV is active.
Currently, the arch_kexec_post_{alloc,free}_pages() unconditionally
maps the kexec areas as decrypted. Obviously, for the SEV case, it can
not work well, need to improve them. Please refer to the first patch
in this patch series.

[b] Set the C-bit in the identity map page table when SEV is active.
Because the second kernel images(kernel and initrd) are loaded to the
encrypted areas, in order to correctly access these encrypted memory(
pages), need to set the C-bit in the identity mapping page table when
kexec builds the identity mapping page table.

[2] How to dump the old memory in the second kernel?

Here, it is similar to the SME kdump, if SEV was enabled in the first 
kernel,
the old memory is also encrypted, the old memory has to be remapped with
memory encryption mask in order to access it properly.

[a] The ioremap_encrypted() is still necessary.
Used to remap the old memory with memory encryption mask.

[b] Enable dumping encrypted memory when SEV was active.
Because the whole memory is encrypted in the first kernel when SEV is
enabled, that is to say, the notes and elfcorehdr are also encrypted,
and they are also saved to the encrypted memory. Following commit
992b649a3f01 ("kdump, proc/vmcore: Enable kdumping encrypted memory with
SME enabled"), both SME and SEV cases need to be considered and modified
correctly. Please refer to the third patch in this patch series.

Hope this help. Thanks.

> And it must explain it because it might be all clear in your head now
> but months from now, you, we, all would've forgotten why this change was
> needed.
> 
> So pls add blurb that this whole effort is being done so that SEV VMs
> can kdump too. I.e., the 1ft picture.
> 
> Anyone must be able to figure out *why* a change has been done just by
> doing git archeology. So make sure you explain it properly.
> 
> If unsure, try to put yourself in the shoes of some future kernel
> developer who is trying to find out why this change has been done. Now
> read the commit message you've written. Does it make any sense to him? I
> think not.
> 
> Do you catch my drift?
> 

Yes, understood, thank you.

So sorry for the delay, i am trying my best to explain it in detail.

Thanks.
Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 RESEND v10] x86/mm, resource: add a new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-04-17 Thread lijiang

在 2019年04月15日 23:41, Borislav Petkov 写道:
> On Mon, Apr 15, 2019 at 08:22:22PM +0800, lijiang wrote:
>> They are different problems.
> 
> Aha, so we're getting closer. You should've lead with that!
> 
>> The first problem is that passes the e820 reserved ranges to the second 
>> kernel,
> 
> Passes or *doesn't* pass?
> 
> Because from all the staring, it wants to pass the reserved ranges.
> 
>> for this case, it is good enough to use the IORES_DESC_RESERVED, which
>> can ensure that exactly matches the reserved resource ranges when
>> walking through iomem resources.
> 
> Ok.
> 
>> The second problem is about the SEV case. Now, the IORES_DESC_RESERVED has 
>> been
>> created for the reserved areas, therefore the check needs to be expanded so 
>> that
>> these areas are not mapped encrypted when using ioremap().
>>
>> +static int __ioremap_check_desc_none_and_reserved(struct resource *res)
> 
> That name is crap. If you need to add another desc type, it becomes
> wrong again. And that whole code around flags->desc_other is just silly:
> 
> Make that machinery around it something like this:
> 
> struct ioremap_desc {
> u64 flags;
> };
> 
> instead of "struct ioremap_mem_flags" and that struct ioremap_desc is an
> ioremap descriptor which will carry all kinds of settings. system_ram
> can then be a simple flag too.
> 
> __ioremap_caller() will hand it down to __ioremap_check_mem() etc
> and there it will set flags like IOREMAP_DESC_MAP_ENCRYPTED or
> IOREMAP_DESC_MAP_DECRYPTED and this way you'll have it explicit and
> clear in __ioremap_caller():
> 
> if ((sev_active() &&
>   (io_desc.flags & IOREMAP_DESC_MAP_ENCRYPTED)) ||
>   encrypted)
> prot = pgprot_encrypted(prot);
> 
> But that would need a pre-patch which does that conversion.
> 
Thanks for your comment.

Based on the above description, i made a draft patch, please refer to it. But it
seems that the code has been changed a lot.

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 0029604af8a4..04217b61635e 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -27,9 +27,8 @@
 
 #include "physaddr.h"
 
-struct ioremap_mem_flags {
-   bool system_ram;
-   bool desc_other;
+struct ioremap_desc {
+   u64 flags;
 };
 
 /*
@@ -61,13 +60,13 @@ int ioremap_change_attr(unsigned long vaddr, unsigned long 
size,
return err;
 }
 
-static bool __ioremap_check_ram(struct resource *res)
+static unsigned long __ioremap_check_ram(struct resource *res)
 {
unsigned long start_pfn, stop_pfn;
unsigned long i;
 
if ((res->flags & IORESOURCE_SYSTEM_RAM) != IORESOURCE_SYSTEM_RAM)
-   return false;
+   return IOREMAP_DESC_MAP_NONE;
 
start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
stop_pfn = (res->end + 1) >> PAGE_SHIFT;
@@ -75,28 +74,44 @@ static bool __ioremap_check_ram(struct resource *res)
for (i = 0; i < (stop_pfn - start_pfn); ++i)
if (pfn_valid(start_pfn + i) &&
!PageReserved(pfn_to_page(start_pfn + i)))
-   return true;
+   return IOREMAP_DESC_MAP_SYSTEM_RAM_USING;
}
 
-   return false;
+   return IOREMAP_DESC_MAP_NONE;
 }
 
-static int __ioremap_check_desc_other(struct resource *res)
+/*
+ * Originally, these areas described as IORES_DESC_NONE are not mapped
+ * as encrypted when using ioremap(), for example, E820_TYPE_{RESERVED,
+ * RESERVED_KERN,RAM,UNUSABLE}, etc. It checks for a resource that is
+ * not described as IORES_DESC_NONE, which can make sure the reserved
+ * areas are not mapped as encrypted when using ioremap().
+ *
+ * Now IORES_DESC_RESERVED has been created for the reserved areas so
+ * the check needs to be expanded so that these areas are not mapped
+ * encrypted when using ioremap().
+ */
+static unsigned long __ioremap_check_desc(struct resource *res)
 {
-   return (res->desc != IORES_DESC_NONE);
+   if ((res->desc != IORES_DESC_NONE) &&
+   (res->desc != IORES_DESC_RESERVED))
+   return IOREMAP_DESC_MAP_ENCRYPTED;
+
+   return IOREMAP_DESC_MAP_NONE;
 }
 
 static int __ioremap_res_check(struct resource *res, void *arg)
 {
-   struct ioremap_mem_flags *flags = arg;
+   struct ioremap_desc *desc = arg;
 
-   if (!flags->system_ram)
-   flags->system_ram = __ioremap_check_ram(res);
+   if (!(desc->flags & IOREMAP_DESC_MAP_SYSTEM_RAM_USING))
+   desc->flags |= __ioremap_check_ram(res);
 
-   if (!flags->desc_other)
-   flags->desc_other =

Re: [PATCH 1/2 RESEND v10] x86/mm, resource: add a new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-04-15 Thread lijiang

在 2019年04月02日 20:43, Borislav Petkov 写道:
> On Tue, Apr 02, 2019 at 08:02:04PM +0800, lijiang wrote:
>> These regions(E820_TYPE_{RESERVED_KERN,RAM,UNUSABLE}) are still marked as
>> IORES_DESC_NONE and should not be mapped encrypted when using ioremap().
> 
> Seems to me like we're going in circles. You said here:
> 
> https://lkml.kernel.org/r/9eb61523-7a08-24c4-ac15-050537bd9...@redhat.com
> 
> that the kernel doesn't pass the e820 reserved ranges to the second
> kernel.
> 
> I suggested to use a special IORES descriptor for them -
> IORES_DES_RESERVED.
> 
> Now you say that that is not enough and some of those you want passed,
> are still marked as IORES_DESC_NONE.
> 
Sorry for the delay.

They are different problems.

The first problem is that passes the e820 reserved ranges to the second kernel,
for this case, it is good enough to use the IORES_DESC_RESERVED, which can
ensure that exactly matches the reserved resource ranges when walking through
iomem resources.

The second problem is about the SEV case. Now, the IORES_DESC_RESERVED has been
created for the reserved areas, therefore the check needs to be expanded so that
these areas are not mapped encrypted when using ioremap().

+static int __ioremap_check_desc_none_and_reserved(struct resource *res)
 {
-   return (res->desc != IORES_DESC_NONE);
+   return ((res->desc != IORES_DESC_NONE) &&
+   (res->desc != IORES_DESC_RESERVED));
 }

Maybe i should split it into two patches. The change of 
__ioremap_check_desc_none_and_reserved()
should be a separate patch. Any idea?

Thanks.
Lianbo

> Sounds to me like you need try again.
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 RESEND v10] x86/mm, resource: add a new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-04-02 Thread lijiang

在 2019年04月02日 17:06, Borislav Petkov 写道:
> On Fri, Mar 29, 2019 at 08:39:13PM +0800, Lianbo Jiang wrote:
>> -static int __ioremap_check_desc_other(struct resource *res)
>> +/*
>> + * Originally, these areas described as IORES_DESC_NONE are not mapped
>> + * as encrypted when using ioremap(), for example, E820_TYPE_{RESERVED,
>> + * RESERVED_KERN,RAM,UNUSABLE}, etc. It checks for a resource that is
>> + * not described as IORES_DESC_NONE, which can make sure the reserved
>> + * areas are not mapped as encrypted when using ioremap().
>> + *
>> + * Now IORES_DESC_RESERVED has been created for the reserved areas so
>> + * the check needs to be expanded so that these areas are not mapped
>> + * encrypted when using ioremap().
>> + */
>> +static int __ioremap_check_desc_none_and_reserved(struct resource *res)
>>  {
>> -return (res->desc != IORES_DESC_NONE);
>> +return ((res->desc != IORES_DESC_NONE) &&
> 
> Why is this still checking IORES_DESC_NONE when the idea is to have this
> specific IORES_DESC_RESERVED for all marked as *reserved* regions in
> e820 which should not be mapped encrypted?
> 
> IOW, which regions are still marked as IORES_DESC_NONE and should not be
> mapped encrypted?
> 
Thanks for your comment.

These regions(E820_TYPE_{RESERVED_KERN,RAM,UNUSABLE}) are still marked as
IORES_DESC_NONE and should not be mapped encrypted when using ioremap().
Please refer to the following function.

static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
{
switch (entry->type) {
case E820_TYPE_ACPI:return IORES_DESC_ACPI_TABLES;
case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
case E820_TYPE_PRAM:return 
IORES_DESC_PERSISTENT_MEMORY_LEGACY;
case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
case E820_TYPE_RAM: /* Fall-through: */
case E820_TYPE_UNUSABLE:/* Fall-through: */
default:return IORES_DESC_NONE;
}
}


Thanks.
Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v10] x86/mm, resource: add a new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-03-29 Thread lijiang

在 2019年03月29日 18:39, Borislav Petkov 写道:
> On Fri, Mar 29, 2019 at 02:56:48PM +0800, Lianbo Jiang wrote:
>> When doing kexec_file_load(), the first kernel needs to pass the e820
>> reserved ranges to the second kernel, because some devices may use it
>> in kdump kernel, such as PCI devices.
>>
>> But, the kernel can not exactly match the e820 reserved ranges when
>> walking through the iomem resources via the 'IORES_DESC_NONE', because
>> there are several types of e820 that are described as the 'IORES_DESC_NONE'
>> type. Please refer to the e820_type_to_iores_desc().
>>
>> Therefore, add a new I/O resource descriptor 'IORES_DESC_RESERVED' for
>> the iomem resources search interfaces. It is helpful to exactly match
>> the reserved resource ranges when walking through iomem resources.
>>
>> In addition, since the new descriptor 'IORES_DESC_RESERVED' has been
>> created for the reserved areas, the code originally related to the
>> descriptor 'IORES_DESC_NONE' also need to be updated.
>>
>> Suggested-by: Borislav Petkov 
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/kernel/e820.c |  2 +-
>>  arch/x86/mm/ioremap.c  | 16 ++--
>>  include/linux/ioport.h |  1 +
>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>> index 2879e234e193..16fcde196243 100644
>> --- a/arch/x86/kernel/e820.c
>> +++ b/arch/x86/kernel/e820.c
>> @@ -1050,10 +1050,10 @@ static unsigned long __init 
>> e820_type_to_iores_desc(struct e820_entry *entry)
>>  case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
>>  case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
>>  case E820_TYPE_PRAM:return 
>> IORES_DESC_PERSISTENT_MEMORY_LEGACY;
>> +case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
>>  case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>>  case E820_TYPE_RAM: /* Fall-through: */
>>  case E820_TYPE_UNUSABLE:/* Fall-through: */
>> -case E820_TYPE_RESERVED:/* Fall-through: */
>>  default:return IORES_DESC_NONE;
>>  }
>>  }
>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>> index 0029604af8a4..5671ec24df49 100644
>> --- a/arch/x86/mm/ioremap.c
>> +++ b/arch/x86/mm/ioremap.c
>> @@ -81,9 +81,21 @@ static bool __ioremap_check_ram(struct resource *res)
>>  return false;
>>  }
>>  
>> -static int __ioremap_check_desc_other(struct resource *res)
> 
> I can see this patch doesn't build even without applying and building
> it.
> 
> How about you build-test your stuff before submitting?
> 
Oh, my God. I made a mistake when i copied the code from another machine.
I will correct this issue and resend the patch v10.

Thanks.
Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3 v9] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-03-28 Thread lijiang

在 2019年03月25日 20:24, Borislav Petkov 写道:
> On Mon, Mar 25, 2019 at 02:53:02PM +0800, lijiang wrote:
>> In this function, i printed its values, and only got the value of reserved
>> type, so i changed the IORES_DESC_NONE to the IORES_DESC_RESERVED.
>>
>> In addition, after the new descriptor 'IORES_DESC_RESERVED' is introduced,
>> the IORES_DESC_NONE does not include the IORES_DESC_RESERVED any more, it
>> could miss to handle the value of the reserved type.
> 
> Yes, IORES_DESC_RESERVED is supposed to denote the e820 reserved type.
> Why should IORES_DESC_NONE include it ?!?!
> 
> IORES_DESC_NONE is, well, an invalid, i.e., "none" type:

Yes, i see. That indicates an empty area, or "void" type.

> 
> /*
>  * I/O Resource Descriptors
>  *
>  * Descriptors are used by walk_iomem_res_desc() and region_intersects()
>  * for searching a specific resource range in the iomem table.  Assign
>  * a new descriptor when a resource range supports the search interfaces.
>  * Otherwise, resource.desc must be set to IORES_DESC_NONE (0).
>  */
> 
>> Do you mean i should never touch the three chunks? If i made a mistake, i
>> will remove this changes next post.
> 
> I'm looking at the hunks below and you're changing ->desc assignments in
> some random function which doesn't look like you know what you're doing.
> Maybe it gets you what you want but it sure as hell doesn't look right
> to me.
> 
I have realized that there could be a problem with this changes. Indeed, here
it denotes an empty area instead of a reserved area.

BTW: Looks like the name of this function(__reserve_region_with_split) is a bit
misleading.

Thank you for pointing out this problem, i will correct them next post.

Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v8] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-03-26 Thread lijiang

在 2019年03月26日 03:34, Lendacky, Thomas 写道:
> On 3/16/19 2:31 AM, lijiang wrote:
>>
>>
>> 在 2018年12月05日 05:33, Lendacky, Thomas 写道:
>>> On 11/29/2018 09:37 PM, Dave Young wrote:
>>>> + more people
>>>>
>>>> On 11/29/18 at 04:09pm, Lianbo Jiang wrote:
>>>>> When doing kexec_file_load, the first kernel needs to pass the e820
>>>>> reserved ranges to the second kernel. But kernel can not exactly
>>>>> match the e820 reserved ranges when walking through the iomem resources
>>>>> with the descriptor 'IORES_DESC_NONE', because several e820 types(
>>>>> e.g. E820_TYPE_RESERVED_KERN/E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820
>>>>> _TYPE_RESERVED) are converted to the descriptor 'IORES_DESC_NONE'. It
>>>>> may pass these four types to the kdump kernel, that is not desired result.
>>>>>
>>>>> So, this patch adds a new I/O resource descriptor 'IORES_DESC_RESERVED'
>>>>> for the iomem resources search interfaces. It is helpful to exactly
>>>>> match the reserved resource ranges when walking through iomem resources.
>>>>>
>>>>> In addition, since the new descriptor 'IORES_DESC_RESERVED' is introduced,
>>>>> these code originally related to the descriptor 'IORES_DESC_NONE' need to
>>>>> be updated. Otherwise, it will be easily confused and also cause some
>>>>> errors. Because the 'E820_TYPE_RESERVED' type is converted to the new
>>>>> descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE', it has been
>>>>> changed.
>>>>>
>>>>> Suggested-by: Dave Young 
>>>>> Signed-off-by: Lianbo Jiang 
>>>>> ---
>>>>>  arch/ia64/kernel/efi.c |  4 
>>>>>  arch/x86/kernel/e820.c |  2 +-
>>>>>  arch/x86/mm/ioremap.c  | 13 -
>>>>>  include/linux/ioport.h |  1 +
>>>>>  kernel/resource.c  |  6 +++---
>>>>>  5 files changed, 21 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
>>>>> index 8f106638913c..1841e9b4db30 100644
>>>>> --- a/arch/ia64/kernel/efi.c
>>>>> +++ b/arch/ia64/kernel/efi.c
>>>>> @@ -1231,6 +1231,10 @@ efi_initialize_iomem_resources(struct resource 
>>>>> *code_resource,
>>>>>   break;
>>>>>  
>>>>>   case EFI_RESERVED_TYPE:
>>>>> + name = "reserved";
>>>>
>>>> Ingo updated X86 code to use "Reserved",  I think it would be good to do
>>>> same for this case as well
>>>>
>>>>> + desc = IORES_DESC_RESERVED;
>>>>> + break;
>>>>> +
>>>>>   case EFI_RUNTIME_SERVICES_CODE:
>>>>>   case EFI_RUNTIME_SERVICES_DATA:
>>>>>   case EFI_ACPI_RECLAIM_MEMORY:
>>>>
>>>> Originally, above 3 are all "reserved", so probably they all should be
>>>> IORES_DESC_RESERVED.
>>>>
>>>> Can any IA64 people to review this?
>>>>
>>>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>>>> index 50895c2f937d..57fafdafb860 100644
>>>>> --- a/arch/x86/kernel/e820.c
>>>>> +++ b/arch/x86/kernel/e820.c
>>>>> @@ -1048,10 +1048,10 @@ static unsigned long __init 
>>>>> e820_type_to_iores_desc(struct e820_entry *entry)
>>>>>   case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
>>>>>   case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
>>>>>   case E820_TYPE_PRAM:return 
>>>>> IORES_DESC_PERSISTENT_MEMORY_LEGACY;
>>>>> + case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
>>>>>   case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>>>>>   case E820_TYPE_RAM: /* Fall-through: */
>>>>>   case E820_TYPE_UNUSABLE:/* Fall-through: */
>>>>> - case E820_TYPE_RESERVED:/* Fall-through: */
>>>>>   default:return IORES_DESC_NONE;
>>>>>   }
>>>>>  }
>>>>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>>>>> index 5378d10f1d31..fea2ef99415d 100644

Re: [PATCH 1/3] kexec: Do not map the kexec area as decrypted when SEV is active

2019-03-25 Thread lijiang

在 2019年03月26日 01:32, Borislav Petkov 写道:
> On Mon, Mar 25, 2019 at 05:17:55PM +, Singh, Brijesh wrote:
>> By default all the memory regions are mapped encrypted. The
>> set_memory_{encrypt,decrypt}() is a generic function which can be
>> called explicitly to clear/set the encryption mask from the existing
>> memory mapping. The mem_encrypt_active() returns true if either SEV or 
>> SME is active. So the __set_memory_enc_dec() uses the
>> memory_encrypt_active() check to ensure that the function is no-op when
>> SME/SEV are not active.
>>
>> Currently, the arch_kexec_post_alloc_pages() unconditionally clear the
>> encryption mask from the kexec area. In case of SEV, we should not clear
>> the encryption mask.
> 
> Brijesh, I know all that.
> 
> Please read what I said here at the end:
> 
> https://lkml.kernel.org/r/20190324150034.gh23...@zn.tnic
> 
> With this change, the code looks like this:
> 
> +   if (sme_active())
> +   return set_memory_decrypted((unsigned long)vaddr, pages);
> 
> now in __set_memory_enc_dec via set_memory_decrypted():
> 
> /* Nothing to do if memory encryption is not active */
> if (!mem_encrypt_active())
> return 0;
> 
> 
> so you have:
> 
>   if (sme_active())
> 
>   ...
> 
>   if (!mem_encrypt_active())
> 
> 
> now maybe this is all clear to you and Tom but I betcha others will get
> confused. Probably something like "well, what should be active now, SME,
> SEV or memory encryption in general"?
> 
> I hope you're catching my drift.
> 
> So if you want to *not* decrypt memory in the SEV case, then doing something
> like this should make it a bit more clear:
> 
> 
>   if (sev_active())
>   return;
> 
>   return set_memory_decrypted((unsigned long)vaddr, pages);
> 
> along with a comment *why* we're checking here.
It looks good to me. I will improve them next post.

Thank you, everyone.

Lianbo

> 
> But actually, I'd prefer if you had separate wrappers which are called
> for SME and for SEV.
> 
> I'll let Tom chime in too.
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3 v9] x86/mm: Change the examination condition to avoid confusion

2019-03-25 Thread lijiang

在 2019年03月25日 14:40, Borislav Petkov 写道:
> On Mon, Mar 25, 2019 at 11:11:45AM +0800, lijiang wrote:
>> I mean it needs to find all the value of the 'IORES_DESC_ACPI_*' type.
> 
> A function called __ioremap_check_desc_other() needs to find
> IORES_DESC_ACPI_* types...
> 
> No, still don't know what you're trying to do.

Let's look at the discussion in patch v8, please refer to this link:
https://lkml.org/lkml/2019/3/16/15

I did a test according to Tom's reply, and the test indicated his suggestion was
correct, we should change this to check for IORES_DESC_ACPI_* values.

> 
>> As above mentioned, it needs to find all the value of the 'IORES_DESC_ACPI_*'
>> type, so we should explicitly use the 'IORES_DESC_ACPI_*' type as the check
>> condition instead of the 'IORES_DESC_NONE'.
> 
> And now the same question I'm asking you each time: WHY does it need to find
> the ACPI types?
> 

When SEV is enabled and the page being mapped is in memory, need to ensure the
memory encryption attribute is also enabled in the resulting mapping.

I believe Tom knows better than me. :-)

Thanks.
Lianbo

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3 v9] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-03-25 Thread lijiang

在 2019年03月23日 03:28, Borislav Petkov 写道:
> On Thu, Mar 21, 2019 at 06:33:08PM +0800, Lianbo Jiang wrote:
>> When doing kexec_file_load, the first kernel needs to pass the e820
> 
> Please end function names with parentheses.
> 
>> reserved ranges to the second kernel.
> 
> ... because... ?
> 
>> But kernel can not exactly match the e820 reserved ranges
>  ^
>  the
> 
>> when walking through the iomem resources with the descriptor
>> 'IORES_DESC_NONE', because several e820 types( e.g.
>> E820_TYPE_RESERVED_KERN/E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820
>> _TYPE_RESERVED) are converted to the descriptor 'IORES_DESC_NONE'.
>> It may pass these four types to the kdump kernel, that is not desired result.
> 
> Rewrite that sentence.
> 
>> So, this patch adds a new I/O resource descriptor 'IORES_DESC_RESERVED'
> 
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
> 
> Also, do
> 
> $ git grep 'This patch' Documentation/process
> 
> for more details.
> 
>> for the iomem resources search interfaces. It is helpful to exactly
>> match the reserved resource ranges when walking through iomem resources.
>>
>> In addition, since the new descriptor 'IORES_DESC_RESERVED' is introduced,
>> these code originally related to the descriptor 'IORES_DESC_NONE' need to
> 
> "the code"
> 
>> be updated.
> 
>> Otherwise, it will be easily confused and also cause some errors.
> 
> What errors?
> 
>> Because the 'E820_TYPE_RESERVED' type is converted to the new
>> descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE', it has been
>> changed.
> 
> That sentence I cannot parse.

Thanks for your comment. I will improve the patch log next post.

> 
>> Suggested-by: Borislav Petkov 
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/kernel/e820.c | 2 +-
>>  include/linux/ioport.h | 1 +
>>  kernel/resource.c  | 6 +++---
>>  3 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>> index 2879e234e193..16fcde196243 100644
>> --- a/arch/x86/kernel/e820.c
>> +++ b/arch/x86/kernel/e820.c
>> @@ -1050,10 +1050,10 @@ static unsigned long __init 
>> e820_type_to_iores_desc(struct e820_entry *entry)
>>  case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
>>  case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
>>  case E820_TYPE_PRAM:return 
>> IORES_DESC_PERSISTENT_MEMORY_LEGACY;
>> +case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
>>  case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>>  case E820_TYPE_RAM: /* Fall-through: */
>>  case E820_TYPE_UNUSABLE:/* Fall-through: */
>> -case E820_TYPE_RESERVED:/* Fall-through: */
>>  default:return IORES_DESC_NONE;
>>  }
>>  }
>> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
>> index da0ebaec25f0..6ed59de48bd5 100644
>> --- a/include/linux/ioport.h
>> +++ b/include/linux/ioport.h
>> @@ -133,6 +133,7 @@ enum {
>>  IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
>>  IORES_DESC_DEVICE_PRIVATE_MEMORY= 6,
>>  IORES_DESC_DEVICE_PUBLIC_MEMORY = 7,
>> +IORES_DESC_RESERVED = 8,
>>  };
>>  
>>  /* helpers to define resources */
> 
> IORES_DESC_RESERVED is supposed to represent E820_TYPE_RESERVED. And if> that 
> is the case, then all three hunks below look wrong to me. If you
> want to pass E820_TYPE_RESERVED ranges, then do that explicitly.

In this function, i printed its values, and only got the value of reserved
type, so i changed the IORES_DESC_NONE to the IORES_DESC_RESERVED.

In addition, after the new descriptor 'IORES_DESC_RESERVED' is introduced,
the IORES_DESC_NONE does not include the IORES_DESC_RESERVED any more, it
could miss to handle the value of the reserved type.

Do you mean i should never touch the three chunks? If i made a mistake, i
will remove this changes next post.

Thanks.
Lianbo

> 
>> diff --git a/kernel/resource.c b/kernel/resource.c
>> index e81b17b53fa5..ee7348761858 100644
>> --- a/kernel/resource.c
>> +++ b/kernel/resource.c
>> @@ -990,7 +990,7 @@ __reserve_region_with_split(struct resource *root, 
>> resource_size_t start,
>>  res->start = start;
>>  res->end = end;
>>  res->flags = type | IORESOURCE_BUSY;
>> -res->desc = IORES_DESC_NONE;
>> +res->desc = IORES_DESC_RESERVED;
>>  
>>  while (1) {
>>  
>> @@ -1025,7 +1025,7 @@ __reserve_region_with_split(struct resource *root, 
>> resource_size_t start,
>>  next_res->start = conflict->end + 1;
>>  next_res->end = end;
>>  next_res->flags = type | IORESOURCE_BUSY;
>> -next_res->desc = IORES_DESC_NONE;
>> +next_res->desc = IORES_DESC_RESERVED;
>>  }
>>  } else {
>>  res->start = conflict->end

Re: [PATCH 1/3 v9] x86/mm: Change the examination condition to avoid confusion

2019-03-24 Thread lijiang

在 2019年03月23日 01:51, Borislav Petkov 写道:
> On Thu, Mar 21, 2019 at 06:33:07PM +0800, Lianbo Jiang wrote:
>> Following the commit <0e4c12b45aa8> ("x86/mm, resource: Use
>> PAGE_KERNEL protection for ioremap of memory pages"),
> 
> The proper commit quotation format is done by adding this to your
> .gitconfig:
> 
> [core]
> abbrev = 12
> [alias]
> one = show -s --pretty='format:%h (\"%s\")'
> 
> and then doing:
> 
> $ git one 
> 
> which will give you
> 
> 0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of 
> memory pages")

Nice. I added them to my .gitconfig. It works. Thank you very much.

> 
>> here it is really checking for the 'IORES_DESC_ACPI_*' values.
> 
> Well, it is not really checking that.

I mean it needs to find all the value of the 'IORES_DESC_ACPI_*' type.

> 
>> Therefore, it is necessary to change the examination condition
>> to avoid confusion.
> 
> What confusion?

As above mentioned, it needs to find all the value of the 'IORES_DESC_ACPI_*'
type, so we should explicitly use the 'IORES_DESC_ACPI_*' type as the check
condition instead of the 'IORES_DESC_NONE'.

Thanks.
Lianbo

> 
> The justification for that change sounds really fishy.
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Do not map the kexec area as decrypted when SEV is active

2019-03-24 Thread lijiang

在 2019年03月24日 23:00, Borislav Petkov 写道:
>> Subject: Re: [PATCH 1/3] kexec: Do not map the kexec area as decrypted when 
>> SEV is active
> 
> The tip tree preferred format for patch subject prefixes is
> 'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
> 'genirq/core:'. Please do not use file names or complete file paths as
> prefix. 'git log path/to/file' should give you a reasonable hint in most
> cases.

Fine, thanks for your advice.

> 
> On Fri, Mar 15, 2019 at 06:32:01PM +0800, Lianbo Jiang wrote:
>> Currently, the arch_kexec_post_{alloc,free}_pages unconditionally
> 
> Please end function names with parentheses.

Ok, i will improve them next post.

> 
>> maps the kexec area as decrypted. This works fine when SME is active.
>> Because in SME, the first kernel is loaded in decrypted area by the
>> BIOS, so the second kernel must be also loaded into the decrypted
>> memory.
>>
>> When SEV is active, the first kernel is loaded into the encrypted
>> area, so the second kernel must be also loaded into the encrypted
>> memory. Lets make sure that arch_kexec_post_{alloc,free}_pages does
>> not clear the memory encryption mask from the kexec area when SEV
>> is active.
> 
> Hold on, wait a minute!
> 
> Why do we even need this? As usual, you guys never explain what the big
> picture is. So you mention SEV, which sounds to me like you want to be
> able to kexec the SEV *guest*. Yes?

Yes. Just like the physical machines support kdump, the virtual machines also
need kdump. When a virtual machine panic, we also need to dump its memory for
analysis.

> 
> First of all, why?

For the SEV virtual machine, the memory is also encrypted. When SEV is enabled,
the first kernel is loaded into the encrypted area. Unlike the SME, the first
kernel is loaded into the decrypted area.

Because of this difference between SME and SEV, we need to properly map the 
kexec
memory area in order to correctly access it.

> 
> Then, if so...
> 
>> Co-developed-by: Brijesh Singh 
>> Signed-off-by: Brijesh Singh 
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86/kernel/machine_kexec_64.c | 8 ++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index ceba408ea982..bcebf4993da4 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -566,7 +566,10 @@ int arch_kexec_post_alloc_pages(void *vaddr, unsigned 
>> int pages, gfp_t gfp)
>>   * not encrypted because when we boot to the new kernel the
>>   * pages won't be accessed encrypted (initially).
>>   */
>> -return set_memory_decrypted((unsigned long)vaddr, pages);
>> +if (sme_active())
>> +return set_memory_decrypted((unsigned long)vaddr, pages);
> 
> ... then this looks yucky. Because, you're adding an sme_active() check here
> but then __set_memory_enc_dec() checks

For the SEV virtual machine, it maps the kexec memroy area as encrypted, so, no 
need to invoke
this function to change anything.


> 
>   if (!mem_encrypt_active())
> 
> and heads will spin from all the checking of memory encryption aspects.
> 
> So this would need a rework so that there are no multiple confusing
> checks.

About the three functions, here i copied their comment from the 
arch/x86/mm/mem_encrypt.c
Please refer to it.

/*
 * SME and SEV are very similar but they are not the same, so there are
 * times that the kernel will need to distinguish between SME and SEV. The
 * sme_active() and sev_active() functions are used for this.  When a
 * distinction isn't needed, the mem_encrypt_active() function can be used.
 *


Thanks.
Lianbo

> 
> Thx.
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2 v8] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-03-16 Thread lijiang



在 2018年12月05日 05:33, Lendacky, Thomas 写道:
> On 11/29/2018 09:37 PM, Dave Young wrote:
>> + more people
>>
>> On 11/29/18 at 04:09pm, Lianbo Jiang wrote:
>>> When doing kexec_file_load, the first kernel needs to pass the e820
>>> reserved ranges to the second kernel. But kernel can not exactly
>>> match the e820 reserved ranges when walking through the iomem resources
>>> with the descriptor 'IORES_DESC_NONE', because several e820 types(
>>> e.g. E820_TYPE_RESERVED_KERN/E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820
>>> _TYPE_RESERVED) are converted to the descriptor 'IORES_DESC_NONE'. It
>>> may pass these four types to the kdump kernel, that is not desired result.
>>>
>>> So, this patch adds a new I/O resource descriptor 'IORES_DESC_RESERVED'
>>> for the iomem resources search interfaces. It is helpful to exactly
>>> match the reserved resource ranges when walking through iomem resources.
>>>
>>> In addition, since the new descriptor 'IORES_DESC_RESERVED' is introduced,
>>> these code originally related to the descriptor 'IORES_DESC_NONE' need to
>>> be updated. Otherwise, it will be easily confused and also cause some
>>> errors. Because the 'E820_TYPE_RESERVED' type is converted to the new
>>> descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE', it has been
>>> changed.
>>>
>>> Suggested-by: Dave Young 
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>>  arch/ia64/kernel/efi.c |  4 
>>>  arch/x86/kernel/e820.c |  2 +-
>>>  arch/x86/mm/ioremap.c  | 13 -
>>>  include/linux/ioport.h |  1 +
>>>  kernel/resource.c  |  6 +++---
>>>  5 files changed, 21 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
>>> index 8f106638913c..1841e9b4db30 100644
>>> --- a/arch/ia64/kernel/efi.c
>>> +++ b/arch/ia64/kernel/efi.c
>>> @@ -1231,6 +1231,10 @@ efi_initialize_iomem_resources(struct resource 
>>> *code_resource,
>>> break;
>>>  
>>> case EFI_RESERVED_TYPE:
>>> +   name = "reserved";
>>
>> Ingo updated X86 code to use "Reserved",  I think it would be good to do
>> same for this case as well
>>
>>> +   desc = IORES_DESC_RESERVED;
>>> +   break;
>>> +
>>> case EFI_RUNTIME_SERVICES_CODE:
>>> case EFI_RUNTIME_SERVICES_DATA:
>>> case EFI_ACPI_RECLAIM_MEMORY:
>>
>> Originally, above 3 are all "reserved", so probably they all should be
>> IORES_DESC_RESERVED.
>>
>> Can any IA64 people to review this?
>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index 50895c2f937d..57fafdafb860 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1048,10 +1048,10 @@ static unsigned long __init 
>>> e820_type_to_iores_desc(struct e820_entry *entry)
>>> case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
>>> case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
>>> case E820_TYPE_PRAM:return 
>>> IORES_DESC_PERSISTENT_MEMORY_LEGACY;
>>> +   case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
>>> case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>>> case E820_TYPE_RAM: /* Fall-through: */
>>> case E820_TYPE_UNUSABLE:/* Fall-through: */
>>> -   case E820_TYPE_RESERVED:/* Fall-through: */
>>> default:return IORES_DESC_NONE;
>>> }
>>>  }
>>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>>> index 5378d10f1d31..fea2ef99415d 100644
>>> --- a/arch/x86/mm/ioremap.c
>>> +++ b/arch/x86/mm/ioremap.c
>>> @@ -83,7 +83,18 @@ static bool __ioremap_check_ram(struct resource *res)
>>>  
>>>  static int __ioremap_check_desc_other(struct resource *res)
>>>  {
>>> -   return (res->desc != IORES_DESC_NONE);
>>> +   /*
>>> +* But now, the 'E820_TYPE_RESERVED' type is converted to the new
>>> +* descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE',
>>> +* it has been changed. And the value of 'mem_flags.desc_other'
>>> +* is equal to 'true' if we don't strengthen the condition in this
>>> +* function, that is wrong. Because originally it is equal to
>>> +* 'false' for the same reserved type.
>>> +*
>>> +* So, that would be nice to keep it the same as before.
>>> +*/
>>> +   return ((res->desc != IORES_DESC_NONE) &&
>>> +   (res->desc != IORES_DESC_RESERVED));
>>>  }
>>
>> Added Tom since he added the check function.  Is it possible to only
>> check explict valid desc types instead of exclude IORES_DESC_NONE?
> 
> Sorry for the delay...
> 
> The original intent of the check was to map most memory as encrypted under
> SEV if it was marked with a specific descriptor, since it was likely to
> not be MMIO. I tried converting most things that mapped memory to memremap
> vs ioremap, but ACPI was one area that I left alone and this check catches
> the mapping of the ACPI tables. I

Re: [PATCH 0/3] Add kdump support for the SEV enabled guest

2019-03-15 Thread lijiang

在 2019年03月15日 18:32, Lianbo Jiang 写道:
> For the AMD SEV machines, add kdump support when the SEV is enabled.
> 
> Test tools:
> makedumpfile[v1.6.5]:
> git://git.code.sf.net/p/makedumpfile/code
> commit  ("Add support for AMD Secure Memory Encryption")
> Note: This patch was merged into the devel branch.
> 
> crash-7.2.5: https://github.com/crash-utility/crash.git

commit <942d813cda35> ("Fix for the "kmem -i" option on Linux 5.0 and later 
kernels")

> 
> kexec-tools-2.0.19:
> git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> commit <942d813cda35> ("Fix for the kmem '-i' option on Linux 5.0")
> http://lists.infradead.org/pipermail/kexec/2019-March/022576.html
> Note: The second kernel cann't boot without this patch. 
> 
> kernel:
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> commit  ("Merge branch 'akpm' (patches from Andrew)")
> 
> Test steps:
> [1] load the vmlinux and initrd for kdump
> # kexec -p /boot/vmlinuz-5.0.0+ --initrd=/boot/initramfs-5.0.0+kdump.img 
> --command-line="BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.0.0+ ro 
> resume=UUID=126c5e95-fc8b-48d6-a23b-28409198a52e console=ttyS0,115200 
> earlyprintk=serial irqpoll nr_cpus=1 reset_devices cgroup_disable=memory 
> mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail 
> acpi_no_memhotplug transparent_hugepage=never disable_cpu_apicid=0"
> 
> [2] trigger panic
> # echo 1 > /proc/sys/kernel/sysrq
> # echo c > /proc/sysrq-trigger
> 
> [3] check and parse the vmcore
> # crash vmlinux /var/crash/127.0.0.1-2019-03-15-05\:03\:42/vmcore
> 
> Lianbo Jiang (3):
>   kexec: Do not map the kexec area as decrypted when SEV is active
>   kexec: Set the C-bit in the identity map page table when SEV is active
>   kdump,proc/vmcore: Enable kdumping encrypted memory when SEV was
> active
> 
>  arch/x86/kernel/machine_kexec_64.c | 20 +---
>  fs/proc/vmcore.c   |  6 +++---
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3] Remove the memory encryption mask to obtain the true physical address

2019-03-11 Thread lijiang

在 2019年03月12日 03:43, Kazuhito Hagio 写道:
> -Original Message-
 [PATCH v3] Remove the memory encryption mask to obtain the true physical 
 address
>>>
>>> I forgot to comment on the subject and the commit log..
>>> I'll change this to
>>>
>>>   x86_64: Add support for AMD Secure Memory Encryption
>>>
>>> On 1/29/2019 9:48 PM, Lianbo Jiang wrote:
 For AMD machine with SME feature, if SME is enabled in the first
 kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
 the memory encryption mask, so makedumpfile needs to remove the
 memory encryption mask to obtain the true physical address.
>>>
>>> I added a few official words from some documents:
>>> ---
>>> On AMD machine with Secure Memory Encryption (SME) feature, if SME is
>>> enabled, page tables contain a specific attribute bit (C-bit) in their
>>> entries to indicate whether a page is encrypted or unencrypted.
>>>
>>> So get NUMBER(sme_mask) from vmcoreinfo, which stores the value of
>>> the C-bit position, and drop it to obtain the true physical address.
>>> ---
>>>
>>> If these are OK, I'll modify them when merging, so you don't need
>>> to repost.
>>>
>>
>> It's fine to me. Thank you, Kazu.
>>
>> Regards,
>> Lianbo
>>
>>> And, I'm thinking to merge this after the kernel patch gets merged
>>> into the mainline.
> 
> Hi Lianbo,
> 
> I found your patch upstream. Applied to the devel branch.
> 

Thank you, Kazu.

Regards,
Lianbo

> Thank you!
> Kazu
> 
> 
>>>
>>> Thanks for your work.
>>> Kazu
>>>

 Signed-off-by: Lianbo Jiang 
 ---
 Changes since v1:
 1. Merge them into a patch.
 2. The sme_mask is not an enum number, remove it.
 3. Sanity check whether the sme_mask is in vmcoreinfo.
 4. Deal with the huge pages case.
 5. Cover the 5-level path.

 Changes since v2:
 1. Change the sme_me_mask to entry_mask.
 2. No need to remove the mask when makedumpfile prints out the
value of the entry.
 3. Remove the sme mask from the pte at the end of the __vtop4_x86_64().
 4. Also need to remove the sme mask from page table entry in
find_vmemmap_x86_64()

  arch/x86_64.c  | 30 +++---
  makedumpfile.c |  4 
  makedumpfile.h |  1 +
  3 files changed, 24 insertions(+), 11 deletions(-)

 diff --git a/arch/x86_64.c b/arch/x86_64.c
 index 537fb78..9977466 100644
 --- a/arch/x86_64.c
 +++ b/arch/x86_64.c
 @@ -291,6 +291,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
unsigned long page_dir, pgd, pud_paddr, pud_pte, pmd_paddr, pmd_pte;
unsigned long pte_paddr, pte;
unsigned long p4d_paddr, p4d_pte;
 +  unsigned long entry_mask = ENTRY_MASK;

/*
 * Get PGD.
 @@ -302,6 +303,9 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
return NOT_PADDR;
}

 +  if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
 +  entry_mask &= ~(NUMBER(sme_mask));
 +
if (check_5level_paging()) {
page_dir += pgd5_index(vaddr) * sizeof(unsigned long);
if (!readmem(PADDR, page_dir, , sizeof pgd)) {
 @@ -318,7 +322,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
/*
 * Get P4D.
 */
 -  p4d_paddr  = pgd & ENTRY_MASK;
 +  p4d_paddr  = pgd & entry_mask;
p4d_paddr += p4d_index(vaddr) * sizeof(unsigned long);
if (!readmem(PADDR, p4d_paddr, _pte, sizeof p4d_pte)) {
ERRMSG("Can't get p4d_pte (p4d_paddr:%lx).\n", 
 p4d_paddr);
 @@ -331,7 +335,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
ERRMSG("Can't get a valid p4d_pte.\n");
return NOT_PADDR;
}
 -  pud_paddr  = p4d_pte & ENTRY_MASK;
 +  pud_paddr  = p4d_pte & entry_mask;
}else {
page_dir += pgd_index(vaddr) * sizeof(unsigned long);
if (!readmem(PADDR, page_dir, , sizeof pgd)) {
 @@ -345,7 +349,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
ERRMSG("Can't get a valid pgd.\n");
return NOT_PADDR;
}
 -  pud_paddr  = pgd & ENTRY_MASK;
 +  pud_paddr  = pgd & entry_mask;
}

/*
 @@ -364,13 +368,13 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
 pagetable)
return NOT_PADDR;
}
if (pud_pte & _PAGE_PSE)/* 1GB pages */
 -  return (pud_pte & ENTRY_MASK & PUD_MASK) +
 +  return (pud_pte & entry_mask & PUD_MASK) +
(vaddr & ~PUD_MASK);

/*
 * Get PMD.
 */
 -  pmd_paddr  = pud_pte & ENTRY_MASK;
 +  pmd_paddr  = pud_pte &

Re: [PATCH v3] Remove the memory encryption mask to obtain the true physical address

2019-02-09 Thread lijiang

在 2019年02月05日 01:12, Kazuhito Hagio 写道:
>> [PATCH v3] Remove the memory encryption mask to obtain the true physical 
>> address
> 
> I forgot to comment on the subject and the commit log..
> I'll change this to
> 
>   x86_64: Add support for AMD Secure Memory Encryption
> 
> On 1/29/2019 9:48 PM, Lianbo Jiang wrote:
>> For AMD machine with SME feature, if SME is enabled in the first
>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>> the memory encryption mask, so makedumpfile needs to remove the
>> memory encryption mask to obtain the true physical address.
> 
> I added a few official words from some documents:
> ---
> On AMD machine with Secure Memory Encryption (SME) feature, if SME is
> enabled, page tables contain a specific attribute bit (C-bit) in their
> entries to indicate whether a page is encrypted or unencrypted.
> 
> So get NUMBER(sme_mask) from vmcoreinfo, which stores the value of
> the C-bit position, and drop it to obtain the true physical address.
> ---
> 
> If these are OK, I'll modify them when merging, so you don't need
> to repost.
> 

It's fine to me. Thank you, Kazu.

Regards,
Lianbo

> And, I'm thinking to merge this after the kernel patch gets merged
> into the mainline.
> 
> Thanks for your work.
> Kazu
> 
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> Changes since v1:
>> 1. Merge them into a patch.
>> 2. The sme_mask is not an enum number, remove it.
>> 3. Sanity check whether the sme_mask is in vmcoreinfo.
>> 4. Deal with the huge pages case.
>> 5. Cover the 5-level path.
>>
>> Changes since v2:
>> 1. Change the sme_me_mask to entry_mask.
>> 2. No need to remove the mask when makedumpfile prints out the
>>value of the entry.
>> 3. Remove the sme mask from the pte at the end of the __vtop4_x86_64().
>> 4. Also need to remove the sme mask from page table entry in
>>find_vmemmap_x86_64()
>>
>>  arch/x86_64.c  | 30 +++---
>>  makedumpfile.c |  4 
>>  makedumpfile.h |  1 +
>>  3 files changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>> index 537fb78..9977466 100644
>> --- a/arch/x86_64.c
>> +++ b/arch/x86_64.c
>> @@ -291,6 +291,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  unsigned long page_dir, pgd, pud_paddr, pud_pte, pmd_paddr, pmd_pte;
>>  unsigned long pte_paddr, pte;
>>  unsigned long p4d_paddr, p4d_pte;
>> +unsigned long entry_mask = ENTRY_MASK;
>>
>>  /*
>>   * Get PGD.
>> @@ -302,6 +303,9 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>
>> +if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>> +entry_mask &= ~(NUMBER(sme_mask));
>> +
>>  if (check_5level_paging()) {
>>  page_dir += pgd5_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>> @@ -318,7 +322,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  /*
>>   * Get P4D.
>>   */
>> -p4d_paddr  = pgd & ENTRY_MASK;
>> +p4d_paddr  = pgd & entry_mask;
>>  p4d_paddr += p4d_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, p4d_paddr, _pte, sizeof p4d_pte)) {
>>  ERRMSG("Can't get p4d_pte (p4d_paddr:%lx).\n", 
>> p4d_paddr);
>> @@ -331,7 +335,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  ERRMSG("Can't get a valid p4d_pte.\n");
>>  return NOT_PADDR;
>>  }
>> -pud_paddr  = p4d_pte & ENTRY_MASK;
>> +pud_paddr  = p4d_pte & entry_mask;
>>  }else {
>>  page_dir += pgd_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>> @@ -345,7 +349,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  ERRMSG("Can't get a valid pgd.\n");
>>  return NOT_PADDR;
>>  }
>> -pud_paddr  = pgd & ENTRY_MASK;
>> +pud_paddr  = pgd & entry_mask;
>>  }
>>
>>  /*
>> @@ -364,13 +368,13 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>  if (pud_pte & _PAGE_PSE)/* 1GB pages */
>> -return (pud_pte & ENTRY_MASK & PUD_MASK) +
>> +return (pud_pte & entry_mask & PUD_MASK) +
>>  (vaddr & ~PUD_MASK);
>>
>>  /*
>>   * Get PMD.
>>   */
>> -pmd_paddr  = pud_pte & ENTRY_MASK;
>> +pmd_paddr  = pud_pte & entry_mask;
>>  pmd_paddr += pmd_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, pmd_paddr, _pte, sizeof pmd_pte)) {
>>  ERRMSG("Can't get pmd_pte (pmd_paddr:%lx).\n", pmd_paddr);
>> @@ -384,13 +388,13 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>

Re: [PATCH v2] Remove the memory encryption mask to obtain the true physical address

2019-01-29 Thread lijiang

在 2019年01月28日 22:24, Lendacky, Thomas 写道:
> On 1/27/19 11:46 PM, Lianbo Jiang wrote:
>> For AMD machine with SME feature, if SME is enabled in the first
>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>> the memory encryption mask, so makedumpfile needs to remove the
>> memory encryption mask to obtain the true physical address.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>> Changes since v1:
>> 1. Merge them into a patch.
>> 2. The sme_mask is not an enum number, remove it.
>> 3. Sanity check whether the sme_mask is in vmcoreinfo.
>> 4. Deal with the huge pages case.
>> 5. Cover the 5-level path.
>>
>>  arch/x86_64.c  | 30 +-
>>  makedumpfile.c |  4 
>>  makedumpfile.h |  1 +
>>  3 files changed, 22 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>> index 537fb78..7b3ed10 100644
>> --- a/arch/x86_64.c
>> +++ b/arch/x86_64.c
>> @@ -291,6 +291,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  unsigned long page_dir, pgd, pud_paddr, pud_pte, pmd_paddr, pmd_pte;
>>  unsigned long pte_paddr, pte;
>>  unsigned long p4d_paddr, p4d_pte;
>> +unsigned long sme_me_mask = ~0UL;
>>  
>>  /*
>>   * Get PGD.
>> @@ -302,6 +303,9 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>  
>> +if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>> +sme_me_mask = ~(NUMBER(sme_mask));
> 
> This is a bit confusing since this isn't the sme_me_mask any more, but the
> complement. Might want to somehow rename this so that it doesn't cause any
> confusion.
>
Thanks for your comment.

I will change the sme_me_mask to entry_mask in patch v3.
 
>> +
>>  if (check_5level_paging()) {
>>  page_dir += pgd5_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>> @@ -309,7 +313,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>  if (info->vaddr_for_vtop == vaddr)
>> -MSG("  PGD : %16lx => %16lx\n", page_dir, pgd);
>> +MSG("  PGD : %16lx => %16lx\n", page_dir, (pgd & 
>> sme_me_mask));
> 
> No need to remove the mask here.  You're just printing out the value of
> the entry. It might be nice to know whether the encryption bit is set or
> not - after all, ENTRY_MASK is still part of this value.
> 

Ok, i will remove it in patch v3.

>>  
>>  if (!(pgd & _PAGE_PRESENT)) {
>>  ERRMSG("Can't get a valid pgd.\n");
>> @@ -318,20 +322,20 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  /*
>>   * Get P4D.
>>   */
>> -p4d_paddr  = pgd & ENTRY_MASK;
>> +p4d_paddr  = pgd & ENTRY_MASK & sme_me_mask;
> 
> This goes back to my original comment that you should just make a local
> variable that is "ENTRY_MASK & ~(NUMBER(sme_mask))" since you are
> performing this ANDing everywhere ENTRY_MASK is used - except then you
> miss the one at the very end of this routine on the return statement.
> 

Ok, i will improve them in patch v3.

>>  p4d_paddr += p4d_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, p4d_paddr, _pte, sizeof p4d_pte)) {
>>  ERRMSG("Can't get p4d_pte (p4d_paddr:%lx).\n", 
>> p4d_paddr);
>>  return NOT_PADDR;
>>  }
>>  if (info->vaddr_for_vtop == vaddr)
>> -MSG("  P4D : %16lx => %16lx\n", p4d_paddr, p4d_pte);
>> +MSG("  P4D : %16lx => %16lx\n", p4d_paddr, (p4d_pte & 
>> sme_me_mask));
>>  
>>  if (!(p4d_pte & _PAGE_PRESENT)) {
>>  ERRMSG("Can't get a valid p4d_pte.\n");
>>  return NOT_PADDR;
>>  }
>> -pud_paddr  = p4d_pte & ENTRY_MASK;
>> +pud_paddr  = p4d_pte & ENTRY_MASK & sme_me_mask;
>>  }else {
>>  page_dir += pgd_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>> @@ -339,13 +343,13 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>  if (info->vaddr_for_vtop == vaddr)
>> -MSG("  PGD : %16lx => %16lx\n", page_dir, pgd);
>> +MSG("  PGD : %16lx => %16lx\n", page_dir, (pgd & 
>> sme_me_mask));
>>  
>>  if (!(pgd & _PAGE_PRESENT)) {
>>  ERRMSG("Can't get a valid pgd.\n");
>>  return NOT_PADDR;
>>  }
>> -pud_paddr  = pgd & ENTRY_MASK;
>> +pud_paddr  = pgd & ENTRY_MASK & sme_me_mask;
>>  }
>>  
>>  /*
>> @@ -357,47 +361,47 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>

Re: [PATCH v2] Remove the memory encryption mask to obtain the true physical address

2019-01-29 Thread lijiang

在 2019年01月29日 03:45, Kazuhito Hagio 写道:
> On 1/28/2019 9:24 AM, Lendacky, Thomas wrote:
>> On 1/27/19 11:46 PM, Lianbo Jiang wrote:
>>> For AMD machine with SME feature, if SME is enabled in the first
>>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>>> the memory encryption mask, so makedumpfile needs to remove the
>>> memory encryption mask to obtain the true physical address.
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>> Changes since v1:
>>> 1. Merge them into a patch.
>>> 2. The sme_mask is not an enum number, remove it.
>>> 3. Sanity check whether the sme_mask is in vmcoreinfo.
>>> 4. Deal with the huge pages case.
>>> 5. Cover the 5-level path.
>>>
>>>  arch/x86_64.c  | 30 +-
>>>  makedumpfile.c |  4 
>>>  makedumpfile.h |  1 +
>>>  3 files changed, 22 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>>> index 537fb78..7b3ed10 100644
>>> --- a/arch/x86_64.c
>>> +++ b/arch/x86_64.c
>>> @@ -291,6 +291,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>> unsigned long page_dir, pgd, pud_paddr, pud_pte, pmd_paddr, pmd_pte;
>>> unsigned long pte_paddr, pte;
>>> unsigned long p4d_paddr, p4d_pte;
>>> +   unsigned long sme_me_mask = ~0UL;
>>>
>>> /*
>>>  * Get PGD.
>>> @@ -302,6 +303,9 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>> return NOT_PADDR;
>>> }
>>>
>>> +   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>>> +   sme_me_mask = ~(NUMBER(sme_mask));
>>
>> This is a bit confusing since this isn't the sme_me_mask any more, but the
>> complement. Might want to somehow rename this so that it doesn't cause any
>> confusion.
>>
>>> +
>>> if (check_5level_paging()) {
>>> page_dir += pgd5_index(vaddr) * sizeof(unsigned long);
>>> if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>>> @@ -309,7 +313,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>> return NOT_PADDR;
>>> }
>>> if (info->vaddr_for_vtop == vaddr)
>>> -   MSG("  PGD : %16lx => %16lx\n", page_dir, pgd);
>>> +   MSG("  PGD : %16lx => %16lx\n", page_dir, (pgd & 
>>> sme_me_mask));
>>
>> No need to remove the mask here.  You're just printing out the value of
>> the entry. It might be nice to know whether the encryption bit is set or
>> not - after all, ENTRY_MASK is still part of this value.
> 
> Agreed.
> 
>>
>>>
>>> if (!(pgd & _PAGE_PRESENT)) {
>>> ERRMSG("Can't get a valid pgd.\n");
>>> @@ -318,20 +322,20 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>> /*
>>>  * Get P4D.
>>>  */
>>> -   p4d_paddr  = pgd & ENTRY_MASK;
>>> +   p4d_paddr  = pgd & ENTRY_MASK & sme_me_mask;
>>
>> This goes back to my original comment that you should just make a local
>> variable that is "ENTRY_MASK & ~(NUMBER(sme_mask))" since you are
>> performing this ANDing everywhere ENTRY_MASK is used - except then you
>> miss the one at the very end of this routine on the return statement.
> 
> This was my idea I said to Lianbo before seeing your comment, but
> yes, including ENTRY_MASK in a local variable is better than that.
> Thanks for your good suggestion.
> 
> As for the variable's name, I think that "entry_mask" is good enough,
> but any better name?
> 
>   unsigned long entry_mask = ENTRY_MASK;
> 
>   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>   entry_mask &= ~(NUMBER(sme_mask));
>   ...
>   p4d_paddr = pgd & entry_mask;
> 
Ok. Thanks.

> And, I found that the find_vmemmap_x86_64() function also uses the
> page table for the -e option and looks to be affected by SME.
> Lianbo, would you fix the function, too?
> 

Yes, it's my pleasure. Thank you, Kazu.

I will fix this function in patch v3.

Regards,
Lianbo

> Thanks,
> Kazu
> 
>>
>>> p4d_paddr += p4d_index(vaddr) * sizeof(unsigned long);
>>> if (!readmem(PADDR, p4d_paddr, _pte, sizeof p4d_pte)) {
>>> ERRMSG("Can't get p4d_pte (p4d_paddr:%lx).\n", 
>>> p4d_paddr);
>>> return NOT_PADDR;
>>> }
>>> if (info->vaddr_for_vtop == vaddr)
>>> -   MSG("  P4D : %16lx => %16lx\n", p4d_paddr, p4d_pte);
>>> +   MSG("  P4D : %16lx => %16lx\n", p4d_paddr, (p4d_pte & 
>>> sme_me_mask));
>>>
>>> if (!(p4d_pte & _PAGE_PRESENT)) {
>>> ERRMSG("Can't get a valid p4d_pte.\n");
>>> return NOT_PADDR;
>>> }
>>> -   pud_paddr  = p4d_pte & ENTRY_MASK;
>>> +   pud_paddr  = p4d_pte & ENTRY_MASK & sme_me_mask;
>>> }else {
>>> page_dir += pgd_index(vaddr) * sizeof(unsigned long);
>>> if (!readmem(PADDR, page_dir, , sizeof pgd)) {
>>> @@ -339,13 +343,13 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>>

Re: [PATCH 2/2] Remove the memory encryption mask to obtain the true physical address

2019-01-27 Thread lijiang

在 2019年01月28日 09:55, lijiang 写道:
> 在 2019年01月25日 22:32, Lendacky, Thomas 写道:
>> On 1/24/19 9:55 PM, dyo...@redhat.com wrote:
>>> + Tom
>>> On 01/25/19 at 11:06am, lijiang wrote:
>>>> 在 2019年01月24日 06:16, Kazuhito Hagio 写道:
>>>>> On 1/22/2019 3:03 AM, Lianbo Jiang wrote:
>>>>>> For AMD machine with SME feature, if SME is enabled in the first
>>>>>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>>>>>> the memory encryption mask, so makedumpfile needs to remove the
>>>>>> memory encryption mask to obtain the true physical address.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang 
>>>>>> ---
>>>>>>  arch/x86_64.c  | 3 +++
>>>>>>  makedumpfile.c | 1 +
>>>>>>  2 files changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>>>>>> index 537fb78..7651d36 100644
>>>>>> --- a/arch/x86_64.c
>>>>>> +++ b/arch/x86_64.c
>>>>>> @@ -346,6 +346,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>>>>> pagetable)
>>>>>>  return NOT_PADDR;
>>>>>>  }
>>>>>>  pud_paddr  = pgd & ENTRY_MASK;
>>>>>> +pud_paddr = pud_paddr & ~(NUMBER(sme_mask));
>>>>>>  }
>>>>>>
>>>>>>  /*
>>>>>> @@ -371,6 +372,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>>>>> pagetable)
>>>>>>   * Get PMD.
>>>>>>   */
>>>>>>  pmd_paddr  = pud_pte & ENTRY_MASK;
>>>>>> +pmd_paddr = pmd_paddr & ~(NUMBER(sme_mask));
>>>>>>  pmd_paddr += pmd_index(vaddr) * sizeof(unsigned long);
>>>>>>  if (!readmem(PADDR, pmd_paddr, _pte, sizeof pmd_pte)) {
>>>>>>  ERRMSG("Can't get pmd_pte (pmd_paddr:%lx).\n", 
>>>>>> pmd_paddr);
>>>>>> @@ -391,6 +393,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>>>>> pagetable)
>>>>>>   * Get PTE.
>>>>>>   */
>>>>>>  pte_paddr  = pmd_pte & ENTRY_MASK;
>>>>>> +pte_paddr = pte_paddr & ~(NUMBER(sme_mask));
>>>>>>  pte_paddr += pte_index(vaddr) * sizeof(unsigned long);
>>>>>>  if (!readmem(PADDR, pte_paddr, , sizeof pte)) {
>>>>>>  ERRMSG("Can't get pte (pte_paddr:%lx).\n", pte_paddr);
>>>>>> diff --git a/makedumpfile.c b/makedumpfile.c
>>>>>> index a03aaa1..81c7bb4 100644
>>>>>> --- a/makedumpfile.c
>>>>>> +++ b/makedumpfile.c
>>>>>> @@ -977,6 +977,7 @@ next_page:
>>>>>>  read_size = MIN(info->page_size - PAGEOFFSET(paddr), size);
>>>>>>
>>>>>>  pgaddr = PAGEBASE(paddr);
>>>>>> +pgaddr = pgaddr & ~(NUMBER(sme_mask));
>>>>>
>>>>> Since NUMBER(sme_mask) is initialized with -1 (NOT_FOUND_NUMBER),
>>>>> if the sme_mask is not in vmcoreinfo, ~(NUMBER(sme_mask)) will be 0.
>>>>> So the four lines added above need
>>>>>
>>>>>   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>>>>> ...
>>>>>
>>>>
>>>> Thank you very much for pointing out my mistake.
>>>>
>>>> I will improve it and post again.
>>
>> Might be worth creating a local variable that includes ENTRY_MASK and
>> NUMBER(sme_mask) so that you make the check just once. Then use that
>> variable in place of ENTRY_MASK in the remainder of the function so
>> that the correct value is used throughout.
>>

Ok.

>> This would also cover the 5-level path which would make this future
>> proof should AMD someday support 5-level paging.
>>
> 
> Thank you, Tom. Makedumpfile will cover the 5-level path in next post,
> though AMD does not support 5-level paging yet.
> 

I mean that i will improve this patch and cover the 5-level path in patch v2.

Thanks.

> Thanks.
> Lianbo
> 
>>>>
>>>>> and, what I'm wondering is whether it doesn't need to take hugepages
>>>>> into account such as this
>>>>&

Re: [PATCH 1/2 v8] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

2019-01-25 Thread lijiang

在 2018年12月05日 05:33, Lendacky, Thomas 写道:
> On 11/29/2018 09:37 PM, Dave Young wrote:
>> + more people
>>
>> On 11/29/18 at 04:09pm, Lianbo Jiang wrote:
>>> When doing kexec_file_load, the first kernel needs to pass the e820
>>> reserved ranges to the second kernel. But kernel can not exactly
>>> match the e820 reserved ranges when walking through the iomem resources
>>> with the descriptor 'IORES_DESC_NONE', because several e820 types(
>>> e.g. E820_TYPE_RESERVED_KERN/E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820
>>> _TYPE_RESERVED) are converted to the descriptor 'IORES_DESC_NONE'. It
>>> may pass these four types to the kdump kernel, that is not desired result.
>>>
>>> So, this patch adds a new I/O resource descriptor 'IORES_DESC_RESERVED'
>>> for the iomem resources search interfaces. It is helpful to exactly
>>> match the reserved resource ranges when walking through iomem resources.
>>>
>>> In addition, since the new descriptor 'IORES_DESC_RESERVED' is introduced,
>>> these code originally related to the descriptor 'IORES_DESC_NONE' need to
>>> be updated. Otherwise, it will be easily confused and also cause some
>>> errors. Because the 'E820_TYPE_RESERVED' type is converted to the new
>>> descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE', it has been
>>> changed.
>>>
>>> Suggested-by: Dave Young 
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>>  arch/ia64/kernel/efi.c |  4 
>>>  arch/x86/kernel/e820.c |  2 +-
>>>  arch/x86/mm/ioremap.c  | 13 -
>>>  include/linux/ioport.h |  1 +
>>>  kernel/resource.c  |  6 +++---
>>>  5 files changed, 21 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
>>> index 8f106638913c..1841e9b4db30 100644
>>> --- a/arch/ia64/kernel/efi.c
>>> +++ b/arch/ia64/kernel/efi.c
>>> @@ -1231,6 +1231,10 @@ efi_initialize_iomem_resources(struct resource 
>>> *code_resource,
>>> break;
>>>  
>>> case EFI_RESERVED_TYPE:
>>> +   name = "reserved";
>>
>> Ingo updated X86 code to use "Reserved",  I think it would be good to do
>> same for this case as well
>>
>>> +   desc = IORES_DESC_RESERVED;
>>> +   break;
>>> +
>>> case EFI_RUNTIME_SERVICES_CODE:
>>> case EFI_RUNTIME_SERVICES_DATA:
>>> case EFI_ACPI_RECLAIM_MEMORY:
>>
>> Originally, above 3 are all "reserved", so probably they all should be
>> IORES_DESC_RESERVED.
>>
>> Can any IA64 people to review this?
>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index 50895c2f937d..57fafdafb860 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1048,10 +1048,10 @@ static unsigned long __init 
>>> e820_type_to_iores_desc(struct e820_entry *entry)
>>> case E820_TYPE_NVS: return IORES_DESC_ACPI_NV_STORAGE;
>>> case E820_TYPE_PMEM:return IORES_DESC_PERSISTENT_MEMORY;
>>> case E820_TYPE_PRAM:return 
>>> IORES_DESC_PERSISTENT_MEMORY_LEGACY;
>>> +   case E820_TYPE_RESERVED:return IORES_DESC_RESERVED;
>>> case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>>> case E820_TYPE_RAM: /* Fall-through: */
>>> case E820_TYPE_UNUSABLE:/* Fall-through: */
>>> -   case E820_TYPE_RESERVED:/* Fall-through: */
>>> default:return IORES_DESC_NONE;
>>> }
>>>  }
>>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>>> index 5378d10f1d31..fea2ef99415d 100644
>>> --- a/arch/x86/mm/ioremap.c
>>> +++ b/arch/x86/mm/ioremap.c
>>> @@ -83,7 +83,18 @@ static bool __ioremap_check_ram(struct resource *res)
>>>  
>>>  static int __ioremap_check_desc_other(struct resource *res)
>>>  {
>>> -   return (res->desc != IORES_DESC_NONE);
>>> +   /*
>>> +* But now, the 'E820_TYPE_RESERVED' type is converted to the new
>>> +* descriptor 'IORES_DESC_RESERVED' instead of 'IORES_DESC_NONE',
>>> +* it has been changed. And the value of 'mem_flags.desc_other'
>>> +* is equal to 'true' if we don't strengthen the condition in this
>>> +* function, that is wrong. Because originally it is equal to
>>> +* 'false' for the same reserved type.
>>> +*
>>> +* So, that would be nice to keep it the same as before.
>>> +*/
>>> +   return ((res->desc != IORES_DESC_NONE) &&
>>> +   (res->desc != IORES_DESC_RESERVED));
>>>  }
>>
>> Added Tom since he added the check function.  Is it possible to only
>> check explict valid desc types instead of exclude IORES_DESC_NONE?
> 
> Sorry for the delay...
> 
> The original intent of the check was to map most memory as encrypted under
> SEV if it was marked with a specific descriptor, since it was likely to
> not be MMIO. I tried converting most things that mapped memory to memremap
> vs ioremap, but ACPI was one area that I left alone and this check catches
> the mapping of the ACPI tables. I

Re: [PATCH 2/2] Remove the memory encryption mask to obtain the true physical address

2019-01-24 Thread lijiang

在 2019年01月25日 03:33, Kazuhito Hagio 写道:
> On 1/23/2019 5:16 PM, Kazuhito Hagio wrote:
>> On 1/22/2019 3:03 AM, Lianbo Jiang wrote:
>>> For AMD machine with SME feature, if SME is enabled in the first
>>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>>> the memory encryption mask, so makedumpfile needs to remove the
>>> memory encryption mask to obtain the true physical address.
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>>  arch/x86_64.c  | 3 +++
>>>  makedumpfile.c | 1 +
>>>  2 files changed, 4 insertions(+)
>>>
>>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>>> index 537fb78..7651d36 100644
>>> --- a/arch/x86_64.c
>>> +++ b/arch/x86_64.c
>>> @@ -346,6 +346,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>> return NOT_PADDR;
>>> }
>>> pud_paddr  = pgd & ENTRY_MASK;
>>> +   pud_paddr = pud_paddr & ~(NUMBER(sme_mask));
>>> }
>>>
>>> /*
>>> @@ -371,6 +372,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>>  * Get PMD.
>>>  */
>>> pmd_paddr  = pud_pte & ENTRY_MASK;
>>> +   pmd_paddr = pmd_paddr & ~(NUMBER(sme_mask));
>>> pmd_paddr += pmd_index(vaddr) * sizeof(unsigned long);
>>> if (!readmem(PADDR, pmd_paddr, _pte, sizeof pmd_pte)) {
>>> ERRMSG("Can't get pmd_pte (pmd_paddr:%lx).\n", pmd_paddr);
>>> @@ -391,6 +393,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>>> pagetable)
>>>  * Get PTE.
>>>  */
>>> pte_paddr  = pmd_pte & ENTRY_MASK;
>>> +   pte_paddr = pte_paddr & ~(NUMBER(sme_mask));
>>> pte_paddr += pte_index(vaddr) * sizeof(unsigned long);
>>> if (!readmem(PADDR, pte_paddr, , sizeof pte)) {
>>> ERRMSG("Can't get pte (pte_paddr:%lx).\n", pte_paddr);
>>> diff --git a/makedumpfile.c b/makedumpfile.c
>>> index a03aaa1..81c7bb4 100644
>>> --- a/makedumpfile.c
>>> +++ b/makedumpfile.c
>>> @@ -977,6 +977,7 @@ next_page:
>>> read_size = MIN(info->page_size - PAGEOFFSET(paddr), size);
>>>
>>> pgaddr = PAGEBASE(paddr);
>>> +   pgaddr = pgaddr & ~(NUMBER(sme_mask));
>>
>> Since NUMBER(sme_mask) is initialized with -1 (NOT_FOUND_NUMBER),
>> if the sme_mask is not in vmcoreinfo, ~(NUMBER(sme_mask)) will be 0.
>> So the four lines added above need
>>
>>   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>> ...
> 
> Considering hugepage and the code, it might be better to add
> a local variable for the mask value to __vtop4_x86_64() function
> and mask it without condition, for example
> 
>   unsigned long sme_mask = ~0UL;
> 
>   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
>   sme_mask = ~(NUMBER(sme_mask));
>   ...
>   pud_paddr = pgd & ENTRY_MASK & sme_mask;
> 
> to avoid adding lots of 'if' statements.
> 

Good idea. Thank you, Kazu.

> Thanks,
> Kazu
> 
>>
>> and, what I'm wondering is whether it doesn't need to take hugepages
>> into account such as this
>>
>> 392 if (pmd_pte & _PAGE_PSE)/* 2MB pages */
>> 393 return (pmd_pte & ENTRY_MASK & PMD_MASK) +
>> 394 (vaddr & ~PMD_MASK);
>> "arch/x86_64.c"
>>
>> Thanks,
>> Kazu
>>
>>
>>> pgbuf = cache_search(pgaddr, read_size);
>>> if (!pgbuf) {
>>> ++cache_miss;
>>> --
>>> 2.17.1
>>>
>>
>>
>>
>> ___
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/2] Remove the memory encryption mask to obtain the true physical address

2019-01-24 Thread lijiang

在 2019年01月24日 06:16, Kazuhito Hagio 写道:
> On 1/22/2019 3:03 AM, Lianbo Jiang wrote:
>> For AMD machine with SME feature, if SME is enabled in the first
>> kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains
>> the memory encryption mask, so makedumpfile needs to remove the
>> memory encryption mask to obtain the true physical address.
>>
>> Signed-off-by: Lianbo Jiang 
>> ---
>>  arch/x86_64.c  | 3 +++
>>  makedumpfile.c | 1 +
>>  2 files changed, 4 insertions(+)
>>
>> diff --git a/arch/x86_64.c b/arch/x86_64.c
>> index 537fb78..7651d36 100644
>> --- a/arch/x86_64.c
>> +++ b/arch/x86_64.c
>> @@ -346,6 +346,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>  return NOT_PADDR;
>>  }
>>  pud_paddr  = pgd & ENTRY_MASK;
>> +pud_paddr = pud_paddr & ~(NUMBER(sme_mask));
>>  }
>>
>>  /*
>> @@ -371,6 +372,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>   * Get PMD.
>>   */
>>  pmd_paddr  = pud_pte & ENTRY_MASK;
>> +pmd_paddr = pmd_paddr & ~(NUMBER(sme_mask));
>>  pmd_paddr += pmd_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, pmd_paddr, _pte, sizeof pmd_pte)) {
>>  ERRMSG("Can't get pmd_pte (pmd_paddr:%lx).\n", pmd_paddr);
>> @@ -391,6 +393,7 @@ __vtop4_x86_64(unsigned long vaddr, unsigned long 
>> pagetable)
>>   * Get PTE.
>>   */
>>  pte_paddr  = pmd_pte & ENTRY_MASK;
>> +pte_paddr = pte_paddr & ~(NUMBER(sme_mask));
>>  pte_paddr += pte_index(vaddr) * sizeof(unsigned long);
>>  if (!readmem(PADDR, pte_paddr, , sizeof pte)) {
>>  ERRMSG("Can't get pte (pte_paddr:%lx).\n", pte_paddr);
>> diff --git a/makedumpfile.c b/makedumpfile.c
>> index a03aaa1..81c7bb4 100644
>> --- a/makedumpfile.c
>> +++ b/makedumpfile.c
>> @@ -977,6 +977,7 @@ next_page:
>>  read_size = MIN(info->page_size - PAGEOFFSET(paddr), size);
>>
>>  pgaddr = PAGEBASE(paddr);
>> +pgaddr = pgaddr & ~(NUMBER(sme_mask));
> 
> Since NUMBER(sme_mask) is initialized with -1 (NOT_FOUND_NUMBER),
> if the sme_mask is not in vmcoreinfo, ~(NUMBER(sme_mask)) will be 0.
> So the four lines added above need
> 
>   if (NUMBER(sme_mask) != NOT_FOUND_NUMBER)
> ...
> 

Thank you very much for pointing out my mistake.

I will improve it and post again.

> and, what I'm wondering is whether it doesn't need to take hugepages
> into account such as this
> 
> 392 if (pmd_pte & _PAGE_PSE)/* 2MB pages */
> 393 return (pmd_pte & ENTRY_MASK & PMD_MASK) +
> 394 (vaddr & ~PMD_MASK);
> "arch/x86_64.c"
> 

This is a good question. Theoretically, it should be modified accordingly for
huge pages case.

But makedumpfile still works well without this change. And i'm sure that the
huge pages are enabled in crashed kernel. This is very strange.

Thanks.
Lianbo

> Thanks,
> Kazu
> 
> 
>>  pgbuf = cache_search(pgaddr, read_size);
>>  if (!pgbuf) {
>>  ++cache_miss;
>> --
>> 2.17.1
>>
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

1 2 3 >

1 - 100 of 203 matches

Mail list logo