RE: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation
Hi, > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Lianbo Jiang > Sent: Monday, January 7, 2019 10:48 AM > To: linux-kernel@vger.kernel.org > Cc: ke...@lists.infradead.org; t...@linutronix.de; mi...@redhat.com; > b...@alien8.de; x...@kernel.org; a...@linux-foundation.org; b...@redhat.com; > dyo...@redhat.com; linux-...@vger.kernel.org; k-ha...@ab.jp.nec.com; > ander...@redhat.com > Subject: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation > > This document lists some variables that export to vmcoreinfo, and briefly > describles what these variables indicate. It should be instructive for > many people who do not know the vmcoreinfo, and it also normalizes the I agree to this part, but > exported variables as a convention between kernel and use-space. I don't agree to this part. The meaning of each symbol is decided by each feature in the kernel, not by vmcoreinfo. I suspect anyone mistakenly understand this document is ABI enforcing each symbol works as described. We can change symbols and their meaning regardless of this document. Oh, I found this topic has already been discussed at v3, and you removed "ABI" in the patch description at v4. But it seems still confusing to me. I think the explicit description saying that this is for user-land tools, they treats each symbol as described, and the document never affect implementation of each kernel components, is necessary in e.g. "Purpose of this document" section? > > Suggested-by: Borislav Petkov > Signed-off-by: Lianbo Jiang > --- > Documentation/kdump/vmcoreinfo.txt | 500 + > 1 file changed, 500 insertions(+) > create mode 100644 Documentation/kdump/vmcoreinfo.txt > > diff --git a/Documentation/kdump/vmcoreinfo.txt > b/Documentation/kdump/vmcoreinfo.txt > new file mode 100644 > index ..8e444586b87b > --- /dev/null > +++ b/Documentation/kdump/vmcoreinfo.txt > @@ -0,0 +1,500 @@ > + > + VMCOREINFO > + > + > +=== > +What is the VMCOREINFO? > +=== > + > +VMCOREINFO is a special ELF note section. It contains various > +information from the kernel like structure size, page size, symbol > +values, field offsets, etc. These data are packed into an ELF note > +section and used by user-space tools like crash and makedumpfile to > +analyze a kernel's memory layout. > + > + > +Common variables > + > + > +init_uts_ns.name.release > + > + > +The version of the Linux kernel. Used to find the corresponding source > +code from which the kernel has been built. > + > +PAGE_SIZE > +- > + > +The size of a page. It is the smallest unit of data for memory > +management in kernel. It is usually 4096 bytes and a page is aligned > +on 4096 bytes. Used for computing page addresses. > + > +init_uts_ns > +--- > + > +This is the UTS namespace, which is used to isolate two specific > +elements of the system that relate to the uname(2) system call. The UTS > +namespace is named after the data structure used to store information > +returned by the uname(2) system call. > + > +User-space tools can get the kernel name, host name, kernel release > +number, kernel version, architecture name and OS type from it. > + > +node_online_map > +--- > + > +An array node_states[N_ONLINE] which represents the set of online node > +in a system, one bit position per node number. Used to keep track of > +which nodes are in the system and online. > + > +swapper_pg_dir > +- > + > +The global page directory pointer of the kernel. Used to translate > +virtual to physical addresses. > + > +_stext > +-- > + > +Defines the beginning of the text section. In general, _stext indicates > +the kernel start address. Used to convert a virtual address from the > +direct kernel map to a physical address. > + > +vmap_area_list > +-- > + > +Stores the virtual area list. makedumpfile can get the vmalloc start > +value from this variable. This value is necessary for vmalloc translation. > + > +mem_map > +--- > + > +Physical addresses are translated to struct pages by treating them as > +an index into the mem_map array. Right-shifting a physical address > +PAGE_SHIFT bits converts it into a page frame number which is an index > +into that mem_map array. > + > +Used to map an address to the corresponding struct page. > + > +contig_page_data > + > + > +Makedumpfile can get the pglist_data structure from this symbol, which > +is used to describe the memory layout. > + > +User-space tools use this to exclude free pages when dumping memory. > + > +mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map) >
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Tuesday, June 5, 2018 11:03 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > ebied...@xmission.com (Eric W. Biederman) writes: > > > "Hatayama, Daisuke" writes: > > > >>> Can you test this and please verify it fixes your issue? > >> > >> I tried this patch on top of v4.17 but the system fails to boot > >> without detecting root disks by dracut like this: > [snip] > > >> OTOH, there's no issue on the pure v4.17 kernel. > >> > >> As above, ls /sys/module looks apparently good. But I guess any part of > >> behavior of getdentries() on sysfs must have changed, affecting the disk > >> detection... > > > > I haven't been able to reproduce this yet. My test system boots fine. > > Which fedora are you testing on? > > I reproduced something similar and fedora 28. So I think I have found > and fixed the issue. I believe I simply reversed the test at the end of > kernfs_dir_pos. AKA "<" instead of ">". Though too late, I used fedora 28 and RHEL7.5. I applied this fix to your patch and the system boot was successfully done. I'll start testing your patch from now on. > > I am going to see if I can test my changes more throughly on this side > and then repost. > > > > >>> fs/kernfs/dir.c | 109 > >>> ++-- > >>> 1 file changed, 67 insertions(+), 42 deletions(-) > >>> > >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > >>> index 89d1dc19340b..8148b5fec48d 100644 > >>> --- a/fs/kernfs/dir.c > >>> +++ b/fs/kernfs/dir.c > >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode > *inode, > >>> struct file *filp) > >>> return 0; > >>> } > >>> > >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos) > >>> +{ > >>> + struct rb_node *node = rb_next(>rb); > >>> + return node ? rb_to_kn(node) : NULL; > >>> +} > >>> + > >>> static struct kernfs_node *kernfs_dir_pos(const void *ns, > >>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos) > >>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved) > >>> { > >>> - if (pos) { > >>> - int valid = kernfs_active(pos) && > >>> - pos->parent == parent && hash == pos->hash; > >>> - kernfs_put(pos); > >>> - if (!valid) > >>> - pos = NULL; > >>> - } > >>> - if (!pos && (hash > 1) && (hash < INT_MAX)) { > >>> - struct rb_node *node = parent->dir.children.rb_node; > >>> - while (node) { > >>> - pos = rb_to_kn(node); > >>> - > >>> - if (hash < pos->hash) > >>> - node = node->rb_left; > >>> - else if (hash > pos->hash) > >>> - node = node->rb_right; > >>> - else > >>> - break; > >>> + struct kernfs_node *pos; > >>> + struct rb_node *node; > >>> + unsigned int hash; > >>> + const char *name = ""; > >>> + > >>> + /* Is off a valid name hash? */ > >>> + if ((off < 2) || (off >= INT_MAX)) > >>> + return NULL; > >>> + hash = off; > >>> + > >>> + /* Is the saved position usable? */ > >>> + if (saved) { > >>> + /* Proper parent and hash? */ > >>> + if ((parent != saved->parent) || (saved->hash != hash)) { > >>> + saved = NULL; > >> > >> name is uninitialized in this path. > > > > It is. name is initialized to "" see above. > > > >>> + } else { > >>> + if (kernfs_active(saved)) > >>> + return saved; > >>> + name = saved->name; > >>> } > >>> } > >>&
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Tuesday, June 5, 2018 11:03 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > ebied...@xmission.com (Eric W. Biederman) writes: > > > "Hatayama, Daisuke" writes: > > > >>> Can you test this and please verify it fixes your issue? > >> > >> I tried this patch on top of v4.17 but the system fails to boot > >> without detecting root disks by dracut like this: > [snip] > > >> OTOH, there's no issue on the pure v4.17 kernel. > >> > >> As above, ls /sys/module looks apparently good. But I guess any part of > >> behavior of getdentries() on sysfs must have changed, affecting the disk > >> detection... > > > > I haven't been able to reproduce this yet. My test system boots fine. > > Which fedora are you testing on? > > I reproduced something similar and fedora 28. So I think I have found > and fixed the issue. I believe I simply reversed the test at the end of > kernfs_dir_pos. AKA "<" instead of ">". Though too late, I used fedora 28 and RHEL7.5. I applied this fix to your patch and the system boot was successfully done. I'll start testing your patch from now on. > > I am going to see if I can test my changes more throughly on this side > and then repost. > > > > >>> fs/kernfs/dir.c | 109 > >>> ++-- > >>> 1 file changed, 67 insertions(+), 42 deletions(-) > >>> > >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > >>> index 89d1dc19340b..8148b5fec48d 100644 > >>> --- a/fs/kernfs/dir.c > >>> +++ b/fs/kernfs/dir.c > >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode > *inode, > >>> struct file *filp) > >>> return 0; > >>> } > >>> > >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos) > >>> +{ > >>> + struct rb_node *node = rb_next(>rb); > >>> + return node ? rb_to_kn(node) : NULL; > >>> +} > >>> + > >>> static struct kernfs_node *kernfs_dir_pos(const void *ns, > >>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos) > >>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved) > >>> { > >>> - if (pos) { > >>> - int valid = kernfs_active(pos) && > >>> - pos->parent == parent && hash == pos->hash; > >>> - kernfs_put(pos); > >>> - if (!valid) > >>> - pos = NULL; > >>> - } > >>> - if (!pos && (hash > 1) && (hash < INT_MAX)) { > >>> - struct rb_node *node = parent->dir.children.rb_node; > >>> - while (node) { > >>> - pos = rb_to_kn(node); > >>> - > >>> - if (hash < pos->hash) > >>> - node = node->rb_left; > >>> - else if (hash > pos->hash) > >>> - node = node->rb_right; > >>> - else > >>> - break; > >>> + struct kernfs_node *pos; > >>> + struct rb_node *node; > >>> + unsigned int hash; > >>> + const char *name = ""; > >>> + > >>> + /* Is off a valid name hash? */ > >>> + if ((off < 2) || (off >= INT_MAX)) > >>> + return NULL; > >>> + hash = off; > >>> + > >>> + /* Is the saved position usable? */ > >>> + if (saved) { > >>> + /* Proper parent and hash? */ > >>> + if ((parent != saved->parent) || (saved->hash != hash)) { > >>> + saved = NULL; > >> > >> name is uninitialized in this path. > > > > It is. name is initialized to "" see above. > > > >>> + } else { > >>> + if (kernfs_active(saved)) > >>> + return saved; > >>> + name = saved->name; > >>> } > >>> } > >>&
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Monday, June 4, 2018 11:45 PM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > "Hatayama, Daisuke" writes: > > > Hello, > > > > Thanks for this patch, Eric. > > > >> -Original Message- > >> From: Eric W. Biederman [mailto:ebied...@xmission.com] > >> Sent: Monday, June 4, 2018 3:51 AM > >> To: Hatayama, Daisuke > >> Cc: 'gre...@linuxfoundation.org' ; > >> 't...@kernel.org' ; Okajima, Toshiyuki > >> ; linux-kernel@vger.kernel.org; > >> 'ebied...@aristanetworks.com' > >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > >> > >> > >> Over the years two bugs have crept into the code that handled the last > >> returned kernfs directory being deleted, and handled general seeking > >> in kernfs directories. The result is that reading the /sys/module > >> directory while moduled are loading and unloading will sometimes > >> skip over a module that was present for the entire duration of > >> the directory read. > >> > >> The first bug is that kernfs_dir_next_pos was advancing the position > >> in the directory even when kernfs_dir_pos had found the starting > >> kernfs directory entry was not usable. In this case kernfs_dir_pos is > >> guaranteed to return the directory entry past the starting directory > >> entry, making it so that advancing the directory position is wrong. > >> > >> The second bug is that kernfs_dir_pos when the saved kernfs_node > >> did not validate, was not always returning a directory after > >> the the target position, but was sometimes returning the directory > >> entry just before the target position. > >> > >> To correct these issues and to make the code easily readable and > >> comprehensible several cleanups are made. > >> > >> - A function kernfs_dir_next is factored out to make it straight-forward > >> to find the next node in a kernfs directory. > >> > >> - A function kernfs_dir_skip_pos is factored out to skip over directory > >> entries that should not be shown to userspace. This function is called > >> from kernfs_fop_readdir directly removing the complication of calling > >> it from the other directory advancement functions. > >> > >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos > >> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved > >> and ensuring the directory position advancment functions can reference > >> the saved node without complications. > >> > >> - The function kernfs_dir_next_pos is modified to only advance the > >> directory > >> position in the common case when kernfs_dir_pos validates and returns > >> the starting kernfs node. For all other cases kernfs_dir_pos is > guaranteed > >> to return a directory position in advance of the starting directory > position. > >> > >> - kernfs_dir_pos is given a significant make over. It's essense remains > the > >> same but some siginficant changes were made. > >> > >> + The offset is validated to be a valid hash value at the very beginning. > >> The validation is updated to handle the fact that neither 0 nor 1 are > >> valid hash values. > >> > >> + An extra test is added at the end of the rbtree lookup to ensure > >> the returned position is at or beyond the target position. > >> > >> + kernfs_name_compare is used during the rbtree lookup instead of just > >> comparing > >> the hash. This allows the the passed namespace parameter to be used, > and > >> if it is available from the saved entry the old saved name as well. > >> > >> + The validation of the saved kernfs node now checks for two cases. > >> If the saved node is returnable. > >> If the name of the saved node is usable during lookup. > >> > >> In net this should result in a easier to understand, and more correct > >> version of directory traversal for kernfs directories. > >> > >> Reported-by: "Hatayama, Daisuke" > >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Monday, June 4, 2018 11:45 PM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > "Hatayama, Daisuke" writes: > > > Hello, > > > > Thanks for this patch, Eric. > > > >> -Original Message- > >> From: Eric W. Biederman [mailto:ebied...@xmission.com] > >> Sent: Monday, June 4, 2018 3:51 AM > >> To: Hatayama, Daisuke > >> Cc: 'gre...@linuxfoundation.org' ; > >> 't...@kernel.org' ; Okajima, Toshiyuki > >> ; linux-kernel@vger.kernel.org; > >> 'ebied...@aristanetworks.com' > >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > >> > >> > >> Over the years two bugs have crept into the code that handled the last > >> returned kernfs directory being deleted, and handled general seeking > >> in kernfs directories. The result is that reading the /sys/module > >> directory while moduled are loading and unloading will sometimes > >> skip over a module that was present for the entire duration of > >> the directory read. > >> > >> The first bug is that kernfs_dir_next_pos was advancing the position > >> in the directory even when kernfs_dir_pos had found the starting > >> kernfs directory entry was not usable. In this case kernfs_dir_pos is > >> guaranteed to return the directory entry past the starting directory > >> entry, making it so that advancing the directory position is wrong. > >> > >> The second bug is that kernfs_dir_pos when the saved kernfs_node > >> did not validate, was not always returning a directory after > >> the the target position, but was sometimes returning the directory > >> entry just before the target position. > >> > >> To correct these issues and to make the code easily readable and > >> comprehensible several cleanups are made. > >> > >> - A function kernfs_dir_next is factored out to make it straight-forward > >> to find the next node in a kernfs directory. > >> > >> - A function kernfs_dir_skip_pos is factored out to skip over directory > >> entries that should not be shown to userspace. This function is called > >> from kernfs_fop_readdir directly removing the complication of calling > >> it from the other directory advancement functions. > >> > >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos > >> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved > >> and ensuring the directory position advancment functions can reference > >> the saved node without complications. > >> > >> - The function kernfs_dir_next_pos is modified to only advance the > >> directory > >> position in the common case when kernfs_dir_pos validates and returns > >> the starting kernfs node. For all other cases kernfs_dir_pos is > guaranteed > >> to return a directory position in advance of the starting directory > position. > >> > >> - kernfs_dir_pos is given a significant make over. It's essense remains > the > >> same but some siginficant changes were made. > >> > >> + The offset is validated to be a valid hash value at the very beginning. > >> The validation is updated to handle the fact that neither 0 nor 1 are > >> valid hash values. > >> > >> + An extra test is added at the end of the rbtree lookup to ensure > >> the returned position is at or beyond the target position. > >> > >> + kernfs_name_compare is used during the rbtree lookup instead of just > >> comparing > >> the hash. This allows the the passed namespace parameter to be used, > and > >> if it is available from the saved entry the old saved name as well. > >> > >> + The validation of the saved kernfs node now checks for two cases. > >> If the saved node is returnable. > >> If the name of the saved node is usable during lookup. > >> > >> In net this should result in a easier to understand, and more correct > >> version of directory traversal for kernfs directories. > >> > >> Reported-by: "Hatayama, Daisuke" > >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> -Original Message- > From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org' > Sent: Saturday, June 2, 2018 2:07 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; Okajima, > Toshiyuki ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > Hello, > > On Fri, Jun 01, 2018 at 09:25:32AM +, Hatayama, Daisuke wrote: > > kernfs_dir_pos() checks if a kernfs_node object given as one of its > > arguments is still active and if so returns it, or returns a > > kernfs_node object that is most equal (possibly smaller and larger) to > > the given object. > > Sometimes they're duplicate operations tho, which is exactly the bug > the posted patch is trying to fix. What I'm suggesting is instead of > leaving both instances and skipping one conditionally, put them in one > place and trigger only when necessary. The sequence of operations > would be exactly the same. The only difference is how the code is > organized. > I see and I think Eric's patch is written as you suggest very well. > > kernfs_dir_next_pos() returns a kernfs_node object that is next to the > > object given by kernfs_dir_pos(). > > > > Two functions does different things and both need to skip inactive > > nodes. I don't think it natural to remove the skip only from > > kernfs_dir_pos(). > > > > OTOH, throughout getdents(), there is no case that the kernfs_node > > object given to kernfs_dir_pos() is used afterwards in the > > processing. So, is it enough to provide kernfs_dir_next_pos() only? > > Then, the skip code is now not duplicated. > > > > The patch below is my thought. How do you think? > > > > But note that this patch has some bug so that system boot get hang > > without detecting root filesystem disk :) I'm debugging this now. > > I haven't looked into the code that closely but given that we had > cases where both skippings were fine and not, the condition is likely > gonna be a bit tricker? I agree to this version looks a bit tricker. But I think once the skipping code is separated as Eric's patch, it would be resolved naturally. > > Thanks. > > -- > tejun >
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> -Original Message- > From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org' > Sent: Saturday, June 2, 2018 2:07 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; Okajima, > Toshiyuki ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > Hello, > > On Fri, Jun 01, 2018 at 09:25:32AM +, Hatayama, Daisuke wrote: > > kernfs_dir_pos() checks if a kernfs_node object given as one of its > > arguments is still active and if so returns it, or returns a > > kernfs_node object that is most equal (possibly smaller and larger) to > > the given object. > > Sometimes they're duplicate operations tho, which is exactly the bug > the posted patch is trying to fix. What I'm suggesting is instead of > leaving both instances and skipping one conditionally, put them in one > place and trigger only when necessary. The sequence of operations > would be exactly the same. The only difference is how the code is > organized. > I see and I think Eric's patch is written as you suggest very well. > > kernfs_dir_next_pos() returns a kernfs_node object that is next to the > > object given by kernfs_dir_pos(). > > > > Two functions does different things and both need to skip inactive > > nodes. I don't think it natural to remove the skip only from > > kernfs_dir_pos(). > > > > OTOH, throughout getdents(), there is no case that the kernfs_node > > object given to kernfs_dir_pos() is used afterwards in the > > processing. So, is it enough to provide kernfs_dir_next_pos() only? > > Then, the skip code is now not duplicated. > > > > The patch below is my thought. How do you think? > > > > But note that this patch has some bug so that system boot get hang > > without detecting root filesystem disk :) I'm debugging this now. > > I haven't looked into the code that closely but given that we had > cases where both skippings were fine and not, the condition is likely > gonna be a bit tricker? I agree to this version looks a bit tricker. But I think once the skipping code is separated as Eric's patch, it would be resolved naturally. > > Thanks. > > -- > tejun >
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
Hello, Thanks for this patch, Eric. > -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Monday, June 4, 2018 3:51 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > > Over the years two bugs have crept into the code that handled the last > returned kernfs directory being deleted, and handled general seeking > in kernfs directories. The result is that reading the /sys/module > directory while moduled are loading and unloading will sometimes > skip over a module that was present for the entire duration of > the directory read. > > The first bug is that kernfs_dir_next_pos was advancing the position > in the directory even when kernfs_dir_pos had found the starting > kernfs directory entry was not usable. In this case kernfs_dir_pos is > guaranteed to return the directory entry past the starting directory > entry, making it so that advancing the directory position is wrong. > > The second bug is that kernfs_dir_pos when the saved kernfs_node > did not validate, was not always returning a directory after > the the target position, but was sometimes returning the directory > entry just before the target position. > > To correct these issues and to make the code easily readable and > comprehensible several cleanups are made. > > - A function kernfs_dir_next is factored out to make it straight-forward > to find the next node in a kernfs directory. > > - A function kernfs_dir_skip_pos is factored out to skip over directory > entries that should not be shown to userspace. This function is called > from kernfs_fop_readdir directly removing the complication of calling > it from the other directory advancement functions. > > - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos > to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved > and ensuring the directory position advancment functions can reference > the saved node without complications. > > - The function kernfs_dir_next_pos is modified to only advance the directory > position in the common case when kernfs_dir_pos validates and returns > the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed > to return a directory position in advance of the starting directory > position. > > - kernfs_dir_pos is given a significant make over. It's essense remains the > same but some siginficant changes were made. > > + The offset is validated to be a valid hash value at the very beginning. > The validation is updated to handle the fact that neither 0 nor 1 are > valid hash values. > > + An extra test is added at the end of the rbtree lookup to ensure > the returned position is at or beyond the target position. > > + kernfs_name_compare is used during the rbtree lookup instead of just > comparing > the hash. This allows the the passed namespace parameter to be used, and > if it is available from the saved entry the old saved name as well. > > + The validation of the saved kernfs node now checks for two cases. > If the saved node is returnable. > If the name of the saved node is usable during lookup. > > In net this should result in a easier to understand, and more correct > version of directory traversal for kernfs directories. > > Reported-by: "Hatayama, Daisuke" > Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup") > Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir > scalability v2") > Signed-off-by: "Eric W. Biederman" > --- > > Can you test this and please verify it fixes your issue? > I tried this patch on top of v4.17 but the system fails to boot without detecting root disks by dracut like this: [ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 199.945999] dracut-initqueue[324]: Warning: dracut-initque
RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
Hello, Thanks for this patch, Eric. > -Original Message- > From: Eric W. Biederman [mailto:ebied...@xmission.com] > Sent: Monday, June 4, 2018 3:51 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' ; Okajima, Toshiyuki > ; linux-kernel@vger.kernel.org; > 'ebied...@aristanetworks.com' > Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks. > > > Over the years two bugs have crept into the code that handled the last > returned kernfs directory being deleted, and handled general seeking > in kernfs directories. The result is that reading the /sys/module > directory while moduled are loading and unloading will sometimes > skip over a module that was present for the entire duration of > the directory read. > > The first bug is that kernfs_dir_next_pos was advancing the position > in the directory even when kernfs_dir_pos had found the starting > kernfs directory entry was not usable. In this case kernfs_dir_pos is > guaranteed to return the directory entry past the starting directory > entry, making it so that advancing the directory position is wrong. > > The second bug is that kernfs_dir_pos when the saved kernfs_node > did not validate, was not always returning a directory after > the the target position, but was sometimes returning the directory > entry just before the target position. > > To correct these issues and to make the code easily readable and > comprehensible several cleanups are made. > > - A function kernfs_dir_next is factored out to make it straight-forward > to find the next node in a kernfs directory. > > - A function kernfs_dir_skip_pos is factored out to skip over directory > entries that should not be shown to userspace. This function is called > from kernfs_fop_readdir directly removing the complication of calling > it from the other directory advancement functions. > > - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos > to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved > and ensuring the directory position advancment functions can reference > the saved node without complications. > > - The function kernfs_dir_next_pos is modified to only advance the directory > position in the common case when kernfs_dir_pos validates and returns > the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed > to return a directory position in advance of the starting directory > position. > > - kernfs_dir_pos is given a significant make over. It's essense remains the > same but some siginficant changes were made. > > + The offset is validated to be a valid hash value at the very beginning. > The validation is updated to handle the fact that neither 0 nor 1 are > valid hash values. > > + An extra test is added at the end of the rbtree lookup to ensure > the returned position is at or beyond the target position. > > + kernfs_name_compare is used during the rbtree lookup instead of just > comparing > the hash. This allows the the passed namespace parameter to be used, and > if it is available from the saved entry the old saved name as well. > > + The validation of the saved kernfs node now checks for two cases. > If the saved node is returnable. > If the name of the saved node is usable during lookup. > > In net this should result in a easier to understand, and more correct > version of directory traversal for kernfs directories. > > Reported-by: "Hatayama, Daisuke" > Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup") > Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir > scalability v2") > Signed-off-by: "Eric W. Biederman" > --- > > Can you test this and please verify it fixes your issue? > I tried this patch on top of v4.17 but the system fails to boot without detecting root disks by dracut like this: [ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts [ 199.945999] dracut-initqueue[324]: Warning: dracut-initque
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> -Original Message- > From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org' > Sent: Wednesday, May 30, 2018 1:26 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; Okajima, > Toshiyuki ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > Hello, > > On Mon, May 28, 2018 at 12:54:03PM +, Hatayama, Daisuke wrote: > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > > index 89d1dc1..3aeeb7a 100644 > > --- a/fs/kernfs/dir.c > > +++ b/fs/kernfs/dir.c > > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode > > *inode, > struct file *filp) > > static struct kernfs_node *kernfs_dir_next_pos(const void *ns, > > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > > { > > + struct kernfs_node *orig = pos; > > + > > pos = kernfs_dir_pos(ns, parent, ino, pos); > > - if (pos) { > > + if (pos && kernfs_sd_compare(pos, orig) <= 0) { > > Hmm... the code seems a bit unintuitive to me and I wonder whether > it's because there are two identical skipping loops in > kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to > selectively disable one of them. Wouldn't it make more sense to get > rid of it from kernfs_dir_pos() and skip explicitly only when > necessary? > kernfs_dir_pos() checks if a kernfs_node object given as one of its arguments is still active and if so returns it, or returns a kernfs_node object that is most equal (possibly smaller and larger) to the given object. kernfs_dir_next_pos() returns a kernfs_node object that is next to the object given by kernfs_dir_pos(). Two functions does different things and both need to skip inactive nodes. I don't think it natural to remove the skip only from kernfs_dir_pos(). OTOH, throughout getdents(), there is no case that the kernfs_node object given to kernfs_dir_pos() is used afterwards in the processing. So, is it enough to provide kernfs_dir_next_pos() only? Then, the skip code is now not duplicated. The patch below is my thought. How do you think? But note that this patch has some bug so that system boot get hang without detecting root filesystem disk :) I'm debugging this now. diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..140706f 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) return 0; } -static struct kernfs_node *kernfs_dir_pos(const void *ns, +static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + if (pos) { int valid = kernfs_active(pos) && pos->parent == parent && hash == pos->hash; @@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns, } } /* Skip over entries which are dying/dead or in the wrong namespace */ - while (pos && (!kernfs_active(pos) || pos->ns != ns)) { + while (pos && (!kernfs_active(pos) || + (!orig || kernfs_sd_compare(pos, orig) <= 0) || + pos->ns != ns)) { struct rb_node *node = rb_next(>rb); if (!node) pos = NULL; @@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns, return pos; } -static struct kernfs_node *kernfs_dir_next_pos(const void *ns, - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) -{ - pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { - do { - struct rb_node *node = rb_next(>rb); - if (!node) - pos = NULL; - else - pos = rb_to_kn(node); - } while (pos && (!kernfs_active(pos) || pos->ns != ns)); - } - return pos; -} - static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) { struct dentry *dentry = file->f_path.dentry; @@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) if (kernfs_ns_enabled(parent)) ns = kernfs_info(dentry->d_sb)->ns; - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos); + for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos); pos; pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) { const char *name = pos->name; Thanks. HATAYAMA, Daisuke
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> -Original Message- > From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org' > Sent: Wednesday, May 30, 2018 1:26 AM > To: Hatayama, Daisuke > Cc: 'gre...@linuxfoundation.org' ; Okajima, > Toshiyuki ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > Hello, > > On Mon, May 28, 2018 at 12:54:03PM +, Hatayama, Daisuke wrote: > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > > index 89d1dc1..3aeeb7a 100644 > > --- a/fs/kernfs/dir.c > > +++ b/fs/kernfs/dir.c > > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode > > *inode, > struct file *filp) > > static struct kernfs_node *kernfs_dir_next_pos(const void *ns, > > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > > { > > + struct kernfs_node *orig = pos; > > + > > pos = kernfs_dir_pos(ns, parent, ino, pos); > > - if (pos) { > > + if (pos && kernfs_sd_compare(pos, orig) <= 0) { > > Hmm... the code seems a bit unintuitive to me and I wonder whether > it's because there are two identical skipping loops in > kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to > selectively disable one of them. Wouldn't it make more sense to get > rid of it from kernfs_dir_pos() and skip explicitly only when > necessary? > kernfs_dir_pos() checks if a kernfs_node object given as one of its arguments is still active and if so returns it, or returns a kernfs_node object that is most equal (possibly smaller and larger) to the given object. kernfs_dir_next_pos() returns a kernfs_node object that is next to the object given by kernfs_dir_pos(). Two functions does different things and both need to skip inactive nodes. I don't think it natural to remove the skip only from kernfs_dir_pos(). OTOH, throughout getdents(), there is no case that the kernfs_node object given to kernfs_dir_pos() is used afterwards in the processing. So, is it enough to provide kernfs_dir_next_pos() only? Then, the skip code is now not duplicated. The patch below is my thought. How do you think? But note that this patch has some bug so that system boot get hang without detecting root filesystem disk :) I'm debugging this now. diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..140706f 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) return 0; } -static struct kernfs_node *kernfs_dir_pos(const void *ns, +static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + if (pos) { int valid = kernfs_active(pos) && pos->parent == parent && hash == pos->hash; @@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns, } } /* Skip over entries which are dying/dead or in the wrong namespace */ - while (pos && (!kernfs_active(pos) || pos->ns != ns)) { + while (pos && (!kernfs_active(pos) || + (!orig || kernfs_sd_compare(pos, orig) <= 0) || + pos->ns != ns)) { struct rb_node *node = rb_next(>rb); if (!node) pos = NULL; @@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns, return pos; } -static struct kernfs_node *kernfs_dir_next_pos(const void *ns, - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) -{ - pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { - do { - struct rb_node *node = rb_next(>rb); - if (!node) - pos = NULL; - else - pos = rb_to_kn(node); - } while (pos && (!kernfs_active(pos) || pos->ns != ns)); - } - return pos; -} - static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) { struct dentry *dentry = file->f_path.dentry; @@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) if (kernfs_ns_enabled(parent)) ns = kernfs_info(dentry->d_sb)->ns; - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos); + for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos); pos; pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) { const char *name = pos->name; Thanks. HATAYAMA, Daisuke
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
Hi, I attach a reproducer for this issue. Expand repro.tar.gz and run two commands: # make # ./repro.sh Then, repro.sh will terminate when the issue is reproduced. In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same hash value. Then, install U1cN9ZbARQ.ko and then repeating installing and removing AtvbAC0jwH.ko over and over. Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison. The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko. I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from fs/kernfs/dirs.c, contained in repro.tar.gz like: # ./kernfshash AtvbAC0jwH U1cN9ZbARQ 6b2609c5 > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke > Sent: Monday, May 28, 2018 9:54 PM > To: 'gre...@linuxfoundation.org' <gre...@linuxfoundation.org>; > 't...@kernel.org' <t...@kernel.org> > Cc: Okajima, Toshiyuki <toshi.okaj...@jp.fujitsu.com>; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > <ebied...@aristanetworks.com> > Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > kernfs_dir_next_pos() overlooks the situation that the dentry > corresponding to a given pos object has already been inactive. Hence, > when kernfs_dir_pos() returns the dentry with a hash value larger than > the original one, kernfs_dir_next_pos() returns the dentry next to the > one returned by kernfs_dir_pos(). As a result, the dentry returned by > kernfs_dir_pos() is skipped. > > To fix this issue, try to find a next node only when the returned > object is less than or equal to the original one. > > Note that this implementation focuses on getting guarantee that the > existing nodes are never skipped, not interested in the other nodes > that are added or removed during the process. > > We found this issue during a stress test that repeatedly reads > /sys/module directory to get a list of the currently loaded kernel > modules while repeatedly loading and unloading kernel modules > simultaneously. > > v2: Fix the case where nodes with the same hash but with the name > larger than the original node could still be skipped. Use > kernfs_sd_compare() to compare kernfs_node objects. Imporove patch > description. > > Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> > Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> > Cc: Eric W. Biederman <ebied...@aristanetworks.com> > --- > fs/kernfs/dir.c |4 +++- > 1 files changed, 3 insertions(+), 1 deletions(-) > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 89d1dc1..3aeeb7a 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, > struct file *filp) > static struct kernfs_node *kernfs_dir_next_pos(const void *ns, > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > { > + struct kernfs_node *orig = pos; > + > pos = kernfs_dir_pos(ns, parent, ino, pos); > - if (pos) { > + if (pos && kernfs_sd_compare(pos, orig) <= 0) { > do { > struct rb_node *node = rb_next(>rb); > if (!node) > -- > 1.7.1 > > > > > > repro.tar.gz Description: repro.tar.gz
RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
Hi, I attach a reproducer for this issue. Expand repro.tar.gz and run two commands: # make # ./repro.sh Then, repro.sh will terminate when the issue is reproduced. In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same hash value. Then, install U1cN9ZbARQ.ko and then repeating installing and removing AtvbAC0jwH.ko over and over. Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison. The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko. I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from fs/kernfs/dirs.c, contained in repro.tar.gz like: # ./kernfshash AtvbAC0jwH U1cN9ZbARQ 6b2609c5 > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke > Sent: Monday, May 28, 2018 9:54 PM > To: 'gre...@linuxfoundation.org' ; > 't...@kernel.org' > Cc: Okajima, Toshiyuki ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip > > kernfs_dir_next_pos() overlooks the situation that the dentry > corresponding to a given pos object has already been inactive. Hence, > when kernfs_dir_pos() returns the dentry with a hash value larger than > the original one, kernfs_dir_next_pos() returns the dentry next to the > one returned by kernfs_dir_pos(). As a result, the dentry returned by > kernfs_dir_pos() is skipped. > > To fix this issue, try to find a next node only when the returned > object is less than or equal to the original one. > > Note that this implementation focuses on getting guarantee that the > existing nodes are never skipped, not interested in the other nodes > that are added or removed during the process. > > We found this issue during a stress test that repeatedly reads > /sys/module directory to get a list of the currently loaded kernel > modules while repeatedly loading and unloading kernel modules > simultaneously. > > v2: Fix the case where nodes with the same hash but with the name > larger than the original node could still be skipped. Use > kernfs_sd_compare() to compare kernfs_node objects. Imporove patch > description. > > Signed-off-by: HATAYAMA Daisuke > Suggested-by: Toshiyuki Okajima > Cc: Eric W. Biederman > --- > fs/kernfs/dir.c |4 +++- > 1 files changed, 3 insertions(+), 1 deletions(-) > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 89d1dc1..3aeeb7a 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, > struct file *filp) > static struct kernfs_node *kernfs_dir_next_pos(const void *ns, > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > { > + struct kernfs_node *orig = pos; > + > pos = kernfs_dir_pos(ns, parent, ino, pos); > - if (pos) { > + if (pos && kernfs_sd_compare(pos, orig) <= 0) { > do { > struct rb_node *node = rb_next(>rb); > if (!node) > -- > 1.7.1 > > > > > > repro.tar.gz Description: repro.tar.gz
[RESEND PATCH v2] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object is less than or equal to the original one. Note that this implementation focuses on getting guarantee that the existing nodes are never skipped, not interested in the other nodes that are added or removed during the process. We found this issue during a stress test that repeatedly reads /sys/module directory to get a list of the currently loaded kernel modules while repeatedly loading and unloading kernel modules simultaneously. v2: Fix the case where nodes with the same hash but with the name larger than the original node could still be skipped. Use kernfs_sd_compare() to compare kernfs_node objects. Imporove patch description. Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> Cc: Eric W. Biederman <ebied...@aristanetworks.com> --- fs/kernfs/dir.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..3aeeb7a 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && kernfs_sd_compare(pos, orig) <= 0) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[RESEND PATCH v2] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object is less than or equal to the original one. Note that this implementation focuses on getting guarantee that the existing nodes are never skipped, not interested in the other nodes that are added or removed during the process. We found this issue during a stress test that repeatedly reads /sys/module directory to get a list of the currently loaded kernel modules while repeatedly loading and unloading kernel modules simultaneously. v2: Fix the case where nodes with the same hash but with the name larger than the original node could still be skipped. Use kernfs_sd_compare() to compare kernfs_node objects. Imporove patch description. Signed-off-by: HATAYAMA Daisuke Suggested-by: Toshiyuki Okajima Cc: Eric W. Biederman --- fs/kernfs/dir.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..3aeeb7a 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && kernfs_sd_compare(pos, orig) <= 0) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[PATCH v2] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object is less than or equal to the original one. Note that this implementation focuses on getting guarantee that the existing nodes are never skipped, not interested in the other nodes that are added or removed during the process. We found this issue during a stress test that repeatedly reads /sys/module directory to get a list of the currently loaded kernel modules while repeatedly loading and unloading kernel modules simultaneously. v2: Fix the case where nodes with the same hash but with the name larger than the original node could still be skipped. Use kernfs_sd_compare() to compare kernfs_node objects. Imporove patch description. Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> Cc: Eric W. Biederman <ebied...@aristanetworks.com> --- fs/kernfs/dir.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..3aeeb7a 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && kernfs_sd_compare(pos, orig) <= 0) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[PATCH v2] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object is less than or equal to the original one. Note that this implementation focuses on getting guarantee that the existing nodes are never skipped, not interested in the other nodes that are added or removed during the process. We found this issue during a stress test that repeatedly reads /sys/module directory to get a list of the currently loaded kernel modules while repeatedly loading and unloading kernel modules simultaneously. v2: Fix the case where nodes with the same hash but with the name larger than the original node could still be skipped. Use kernfs_sd_compare() to compare kernfs_node objects. Imporove patch description. Signed-off-by: HATAYAMA Daisuke Suggested-by: Toshiyuki Okajima Cc: Eric W. Biederman --- fs/kernfs/dir.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..3aeeb7a 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) static struct kernfs_node *kernfs_dir_next_pos(const void *ns, struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { + struct kernfs_node *orig = pos; + pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && kernfs_sd_compare(pos, orig) <= 0) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
RE: [PATCH] kernfs: fix dentry unexpected skip
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke > Sent: Saturday, May 19, 2018 12:43 AM > To: 'gre...@linuxfoundation.org' <gre...@linuxfoundation.org> > Cc: Okajima, Toshiyuki/岡嶋 寿行 <toshi.okaj...@jp.fujitsu.com>; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > <ebied...@aristanetworks.com> > Subject: [PATCH] kernfs: fix dentry unexpected skip > > kernfs_dir_next_pos() overlooks the situation that the dentry > corresponding to a given pos object has already been inactive. Hence, > when kernfs_dir_pos() returns the dentry with a hash value larger than > the original one, kernfs_dir_next_pos() returns the dentry next to the > one returned by kernfs_dir_pos(). As a result, the dentry returned by > kernfs_dir_pos() is skipped. > > To fix this issue, try to find a next node only when the returned > object has a hash value equal to or smaller than the original one. > > Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> > Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> > Cc: Eric W. Biederman <ebied...@aristanetworks.com> > --- > fs/kernfs/dir.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 89d1dc1..8a2f49c 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, > struct file *filp) > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > { > pos = kernfs_dir_pos(ns, parent, ino, pos); > - if (pos) { > + if (pos && pos->hash <= ino) { I found this condition still misses the case that the returend pos with the same hash value but with different name is skipped. I'll post v2 patch. > do { > struct rb_node *node = rb_next(>rb); > if (!node) > -- > 1.7.1 > > >
RE: [PATCH] kernfs: fix dentry unexpected skip
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke > Sent: Saturday, May 19, 2018 12:43 AM > To: 'gre...@linuxfoundation.org' > Cc: Okajima, Toshiyuki/岡嶋 寿行 ; > linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com' > > Subject: [PATCH] kernfs: fix dentry unexpected skip > > kernfs_dir_next_pos() overlooks the situation that the dentry > corresponding to a given pos object has already been inactive. Hence, > when kernfs_dir_pos() returns the dentry with a hash value larger than > the original one, kernfs_dir_next_pos() returns the dentry next to the > one returned by kernfs_dir_pos(). As a result, the dentry returned by > kernfs_dir_pos() is skipped. > > To fix this issue, try to find a next node only when the returned > object has a hash value equal to or smaller than the original one. > > Signed-off-by: HATAYAMA Daisuke > Suggested-by: Toshiyuki Okajima > Cc: Eric W. Biederman > --- > fs/kernfs/dir.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 89d1dc1..8a2f49c 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, > struct file *filp) > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) > { > pos = kernfs_dir_pos(ns, parent, ino, pos); > - if (pos) { > + if (pos && pos->hash <= ino) { I found this condition still misses the case that the returend pos with the same hash value but with different name is skipped. I'll post v2 patch. > do { > struct rb_node *node = rb_next(>rb); > if (!node) > -- > 1.7.1 > > >
[PATCH] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object has a hash value equal to or smaller than the original one. Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> Cc: Eric W. Biederman <ebied...@aristanetworks.com> --- fs/kernfs/dir.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..8a2f49c 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && pos->hash <= ino) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[PATCH] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object has a hash value equal to or smaller than the original one. Signed-off-by: HATAYAMA Daisuke Suggested-by: Toshiyuki Okajima Cc: Eric W. Biederman --- fs/kernfs/dir.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..8a2f49c 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && pos->hash <= ino) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[PATCH] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object has a hash value equal to or smaller than the original one. Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com> Cc: Eric W. Biederman <ebied...@aristanetworks.com> --- fs/kernfs/dir.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..8a2f49c 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && pos->hash <= ino) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
[PATCH] kernfs: fix dentry unexpected skip
kernfs_dir_next_pos() overlooks the situation that the dentry corresponding to a given pos object has already been inactive. Hence, when kernfs_dir_pos() returns the dentry with a hash value larger than the original one, kernfs_dir_next_pos() returns the dentry next to the one returned by kernfs_dir_pos(). As a result, the dentry returned by kernfs_dir_pos() is skipped. To fix this issue, try to find a next node only when the returned object has a hash value equal to or smaller than the original one. Signed-off-by: HATAYAMA Daisuke Suggested-by: Toshiyuki Okajima Cc: Eric W. Biederman --- fs/kernfs/dir.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 89d1dc1..8a2f49c 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp) struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos) { pos = kernfs_dir_pos(ns, parent, ino, pos); - if (pos) { + if (pos && pos->hash <= ino) { do { struct rb_node *node = rb_next(>rb); if (!node) -- 1.7.1
RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
Marc-Andre, It looks to me that the 4th and 5th patches somehow has not been sent. Could you send them, too? I'd like them to actually build and run the kernel for testing. > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau > Sent: Wednesday, November 8, 2017 1:24 AM > To: linux-kernel@vger.kernel.org > Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau >> Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support > > Hi, > > This series adds DMA operations support to the qemu fw_cfg kernel > module and populates "etc/vmcoreinfo" with vmcoreinfo location > details. > > Note: the support for this entry handling has been merged for next > qemu release (2.11) > > v5: > - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch > > v4: > - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build > - fix build with !CONFIG_CRASH_CORE > - replace the unbounded yield() loop with a usleep_range() loop and a > 200ms timeout > - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama) > - drop the experimental sysfs write support patch from this series > > v3: (thanks kbuild) > - add "fw_cfg: fix the command line module name" patch > - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y > - fix 'Wshift-count-overflow' > > v2: > - use platform device for dma mapping > - add etc/vmcoreinfo patch > - some code cleanups > > Marc-André Lureau (5): > fw_cfg: fix the command line module name > fw_cfg: add DMA register > fw_cfg: do DMA read operation > crash: export paddr_vmcoreinfo_note() > fw_cfg: write vmcoreinfo details > > drivers/firmware/qemu_fw_cfg.c | 292 > + > kernel/crash_core.c| 1 + > 2 files changed, 264 insertions(+), 29 deletions(-) > > -- > 2.15.0.125.g8f49766d64 > >
RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
Marc-Andre, It looks to me that the 4th and 5th patches somehow has not been sent. Could you send them, too? I'd like them to actually build and run the kernel for testing. > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau > Sent: Wednesday, November 8, 2017 1:24 AM > To: linux-kernel@vger.kernel.org > Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau > > Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support > > Hi, > > This series adds DMA operations support to the qemu fw_cfg kernel > module and populates "etc/vmcoreinfo" with vmcoreinfo location > details. > > Note: the support for this entry handling has been merged for next > qemu release (2.11) > > v5: > - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch > > v4: > - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build > - fix build with !CONFIG_CRASH_CORE > - replace the unbounded yield() loop with a usleep_range() loop and a > 200ms timeout > - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama) > - drop the experimental sysfs write support patch from this series > > v3: (thanks kbuild) > - add "fw_cfg: fix the command line module name" patch > - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y > - fix 'Wshift-count-overflow' > > v2: > - use platform device for dma mapping > - add etc/vmcoreinfo patch > - some code cleanups > > Marc-André Lureau (5): > fw_cfg: fix the command line module name > fw_cfg: add DMA register > fw_cfg: do DMA read operation > crash: export paddr_vmcoreinfo_note() > fw_cfg: write vmcoreinfo details > > drivers/firmware/qemu_fw_cfg.c | 292 > + > kernel/crash_core.c| 1 + > 2 files changed, 264 insertions(+), 29 deletions(-) > > -- > 2.15.0.125.g8f49766d64 > >
RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin > Sent: Tuesday, October 31, 2017 1:19 AM > To: Hatayama, Daisuke <d.hatay...@jp.fujitsu.com> > Cc: 'marcandre.lur...@redhat.com' <marcandre.lur...@redhat.com>; > 'so...@cmu.edu' <so...@cmu.edu>; linux-kernel@vger.kernel.org; > 'qemu-de...@nongnu.org' <qemu-de...@nongnu.org> > Subject: Re: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs > > On Mon, Oct 30, 2017 at 11:07:20AM +, Hatayama, Daisuke wrote: > > > From: Marc-André Lureau <marcandre.lur...@redhat.com> > > > > > > Since qemu 2.9, DMA write operations are allowed. However, usage of this > > > interface from kernel or user-space is strongly discouraged by the > > > maintainers. This patch is meant for experimentations for now. > > > > > > > Could you (or maintainers?) tell me how experimental the DMA write > > operations from kernel are? > > > > From some technical reason? > > Or simply there has not been enough test yet so far? > > The concern is security, talking from userspace to hypervisor > might become a trivial DOS vector. > > If there's need for a specific entry to be accessible from userspace, > I'd rather white-list it. > Thanks for the explanation. If so, I guess the white-list is done in userspace qemu side, and no additional change would be needed for this kernel-side patch set for the white-list feature. Is this understanding correct? > > > Signed-off-by: Marc-André Lureau <marcandre.lur...@redhat.com> > > > --- > > > drivers/firmware/qemu_fw_cfg.c | 21 - > > > 1 file changed, 20 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/firmware/qemu_fw_cfg.c > b/drivers/firmware/qemu_fw_cfg.c > > > index 54b569da3257..e2f2ad1c9c0c 100644 > > > --- a/drivers/firmware/qemu_fw_cfg.c > > > +++ b/drivers/firmware/qemu_fw_cfg.c > > > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file > > > *filp, > struct kobject *kobj, > > > return fw_cfg_read_blob(entry->f.select, buf, pos, count, true); > > > } > > > > > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject > *kobj, > > > + struct bin_attribute *bin_attr, > > > + char *buf, loff_t pos, size_t count) > > > +{ > > > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj); > > > + > > > + if (!fw_cfg_dma_enabled()) > > > + return -ENOTSUPP; > > > + > > > + if (pos > entry->f.size) > > > + return -EINVAL; > > > + > > > + if (count > entry->f.size - pos) > > > + count = entry->f.size - pos; > > > + > > > + return fw_cfg_write_blob(entry->f.select, buf, pos, count); > > > +} > > > + > > > static struct bin_attribute fw_cfg_sysfs_attr_raw = { > > > - .attr = { .name = "raw", .mode = S_IRUSR }, > > > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR }, > > > .read = fw_cfg_sysfs_read_raw, > > > + .write = fw_cfg_sysfs_write_raw, > > > }; > > > > > > /* > > > -- > > > 2.14.1.146.gd35faa819 > > > > Thanks. > > HATAYAMA, Daisuke > > >
RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin > Sent: Tuesday, October 31, 2017 1:19 AM > To: Hatayama, Daisuke > Cc: 'marcandre.lur...@redhat.com' ; > 'so...@cmu.edu' ; linux-kernel@vger.kernel.org; > 'qemu-de...@nongnu.org' > Subject: Re: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs > > On Mon, Oct 30, 2017 at 11:07:20AM +, Hatayama, Daisuke wrote: > > > From: Marc-André Lureau > > > > > > Since qemu 2.9, DMA write operations are allowed. However, usage of this > > > interface from kernel or user-space is strongly discouraged by the > > > maintainers. This patch is meant for experimentations for now. > > > > > > > Could you (or maintainers?) tell me how experimental the DMA write > > operations from kernel are? > > > > From some technical reason? > > Or simply there has not been enough test yet so far? > > The concern is security, talking from userspace to hypervisor > might become a trivial DOS vector. > > If there's need for a specific entry to be accessible from userspace, > I'd rather white-list it. > Thanks for the explanation. If so, I guess the white-list is done in userspace qemu side, and no additional change would be needed for this kernel-side patch set for the white-list feature. Is this understanding correct? > > > Signed-off-by: Marc-André Lureau > > > --- > > > drivers/firmware/qemu_fw_cfg.c | 21 - > > > 1 file changed, 20 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/firmware/qemu_fw_cfg.c > b/drivers/firmware/qemu_fw_cfg.c > > > index 54b569da3257..e2f2ad1c9c0c 100644 > > > --- a/drivers/firmware/qemu_fw_cfg.c > > > +++ b/drivers/firmware/qemu_fw_cfg.c > > > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file > > > *filp, > struct kobject *kobj, > > > return fw_cfg_read_blob(entry->f.select, buf, pos, count, true); > > > } > > > > > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject > *kobj, > > > + struct bin_attribute *bin_attr, > > > + char *buf, loff_t pos, size_t count) > > > +{ > > > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj); > > > + > > > + if (!fw_cfg_dma_enabled()) > > > + return -ENOTSUPP; > > > + > > > + if (pos > entry->f.size) > > > + return -EINVAL; > > > + > > > + if (count > entry->f.size - pos) > > > + count = entry->f.size - pos; > > > + > > > + return fw_cfg_write_blob(entry->f.select, buf, pos, count); > > > +} > > > + > > > static struct bin_attribute fw_cfg_sysfs_attr_raw = { > > > - .attr = { .name = "raw", .mode = S_IRUSR }, > > > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR }, > > > .read = fw_cfg_sysfs_read_raw, > > > + .write = fw_cfg_sysfs_write_raw, > > > }; > > > > > > /* > > > -- > > > 2.14.1.146.gd35faa819 > > > > Thanks. > > HATAYAMA, Daisuke > > >
RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> From: Marc-André Lureau <marcandre.lur...@redhat.com> > > Since qemu 2.9, DMA write operations are allowed. However, usage of this > interface from kernel or user-space is strongly discouraged by the > maintainers. This patch is meant for experimentations for now. > Could you (or maintainers?) tell me how experimental the DMA write operations from kernel are? From some technical reason? Or simply there has not been enough test yet so far? > Signed-off-by: Marc-André Lureau <marcandre.lur...@redhat.com> > --- > drivers/firmware/qemu_fw_cfg.c | 21 - > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c > index 54b569da3257..e2f2ad1c9c0c 100644 > --- a/drivers/firmware/qemu_fw_cfg.c > +++ b/drivers/firmware/qemu_fw_cfg.c > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file *filp, > struct kobject *kobj, > return fw_cfg_read_blob(entry->f.select, buf, pos, count, true); > } > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject > *kobj, > + struct bin_attribute *bin_attr, > + char *buf, loff_t pos, size_t count) > +{ > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj); > + > + if (!fw_cfg_dma_enabled()) > + return -ENOTSUPP; > + > + if (pos > entry->f.size) > + return -EINVAL; > + > + if (count > entry->f.size - pos) > + count = entry->f.size - pos; > + > + return fw_cfg_write_blob(entry->f.select, buf, pos, count); > +} > + > static struct bin_attribute fw_cfg_sysfs_attr_raw = { > - .attr = { .name = "raw", .mode = S_IRUSR }, > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR }, > .read = fw_cfg_sysfs_read_raw, > + .write = fw_cfg_sysfs_write_raw, > }; > > /* > -- > 2.14.1.146.gd35faa819 Thanks. HATAYAMA, Daisuke
RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> From: Marc-André Lureau > > Since qemu 2.9, DMA write operations are allowed. However, usage of this > interface from kernel or user-space is strongly discouraged by the > maintainers. This patch is meant for experimentations for now. > Could you (or maintainers?) tell me how experimental the DMA write operations from kernel are? From some technical reason? Or simply there has not been enough test yet so far? > Signed-off-by: Marc-André Lureau > --- > drivers/firmware/qemu_fw_cfg.c | 21 - > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c > index 54b569da3257..e2f2ad1c9c0c 100644 > --- a/drivers/firmware/qemu_fw_cfg.c > +++ b/drivers/firmware/qemu_fw_cfg.c > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file *filp, > struct kobject *kobj, > return fw_cfg_read_blob(entry->f.select, buf, pos, count, true); > } > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject > *kobj, > + struct bin_attribute *bin_attr, > + char *buf, loff_t pos, size_t count) > +{ > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj); > + > + if (!fw_cfg_dma_enabled()) > + return -ENOTSUPP; > + > + if (pos > entry->f.size) > + return -EINVAL; > + > + if (count > entry->f.size - pos) > + count = entry->f.size - pos; > + > + return fw_cfg_write_blob(entry->f.select, buf, pos, count); > +} > + > static struct bin_attribute fw_cfg_sysfs_attr_raw = { > - .attr = { .name = "raw", .mode = S_IRUSR }, > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR }, > .read = fw_cfg_sysfs_read_raw, > + .write = fw_cfg_sysfs_write_raw, > }; > > /* > -- > 2.14.1.146.gd35faa819 Thanks. HATAYAMA, Daisuke
Re: [PATCH V2] proc-vmcore: wrong data type casting fix
From: Baoquan He <b...@redhat.com> Subject: Re: [PATCH V2] proc-vmcore: wrong data type casting fix Date: Mon, 14 Mar 2016 11:50:53 +0800 > On 03/14/16 at 11:47am, Baoquan He wrote: >> On 03/14/16 at 12:25pm, HATAYAMA Daisuke wrote: >> > From: Dave Young <dyo...@redhat.com> >> > Subject: [PATCH V2] proc-vmcore: wrong data type casting fix >> > Date: Fri, 11 Mar 2016 16:42:48 +0800 >> > >> > > On i686 PAE enabled machine the contiguous physical area could be large >> > > and it can cause trimming down variables in below calculation in >> > > read_vmcore() and mmap_vmcore(): >> > > >> > > tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); >> > > >> > > Then the real size passed down is not correct any more. >> > > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then >> > >> > That is, size_t and loff_t are defined as follows on i686: >> > >> > (gdb) ptype size_t >> > type = unsigned int >> > (gdb) ptype loff_t >> > type = long long int >> > >> > So casting by size_t means truncating a given value by 4GB. >> > >> > Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB, >> > and is aligned with 4GB, the resulted value is 0, and >> >> Truncating doesn't mean align. If (m->offset + m->size - *fpos) is >> larger than 4G, E.g 0x1000f, the truncating result will be 0xf. >> Thanks for pointing out. I was confused with truncation and alignment. I wanted to say here... >> > >> > > we will get tsz = 0. It is of course not an expected result. >> >> We won't always get "tsz=0", just get the lower 32 bit value. truncating the upper 32 bits and just getting the lower 32 bits as you say. > > OK, didn't get you still have saying in next paragraph. Ignore this > please. > >> >> > > >> > > During our tests there are two problems caused by it: >> > > 1) read_vmcore will refuse to continue so makedumpfile fails. >> > > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). >> > > >> > >> > we reach these errors. >> > >> > If (m->offset + m->size - *fpos) is not aligned with 4GB, >> > read_vmcore() or mmap_vmcore() is performed with the truncated >> > non-zero value as size (of course, this is also not expected value but >> > the execution doesn't result in error). Then, fpos proceeds so that >> > (m->offset + m->size - *fpos) is aligned with 4GB in the next loop, >> > and we after all reach the errors. >> > >> > I think your patch description needs a bit more detail. >> > >> > It seems good to me that the patch itself. >> > >> > > Use unsigned long long in min_t instead so that the variables are not >> > > truncated. >> > > >> > > Signed-off-by: Baoquan He <b...@redhat.com> >> > > Signed-off-by: Dave Young <dyo...@redhat.com> >> > > --- >> > > v1->v2: spelling fix in patch log >> > > fs/proc/vmcore.c |7 +-- >> > > 1 file changed, 5 insertions(+), 2 deletions(-) >> > > >> > > --- linux-x86.orig/fs/proc/vmcore.c >> > > +++ linux-x86/fs/proc/vmcore.c >> > > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe >> > > >> > > list_for_each_entry(m, _list, list) { >> > > if (*fpos < m->offset + m->size) { >> > > -tsz = min_t(size_t, m->offset + m->size - >> > > *fpos, buflen); >> > > +tsz = (size_t)min_t(unsigned long long, >> > > +m->offset + m->size - *fpos, >> > > +buflen); >> > > start = m->paddr + *fpos - m->offset; >> > > tmp = read_from_oldmem(buffer, tsz, , >> > > userbuf); >> > > if (tmp < 0) >> > > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file >> > > if (start < m->offset + m->size) { >> > > u64 paddr = 0; >> > > >> > > -tsz = min_t(size_t, m->offset + m->size - >> > > start, size); >> > > +tsz = (size_t)min_t(unsigned long long, >> > > +m->offset + m->size - >> > > start, size); >> > > paddr = m->paddr + start - m->offset; >> > > if (vmcore_remap_oldmem_pfn(vma, vma->vm_start >> > > + len, >> > > paddr >> >> > > PAGE_SHIFT, tsz, >> > > >> > -- >> > Thanks. >> > HATAYAMA, Daisuke > -- Thanks. HATAYAMA, Daisuke
Re: [PATCH V2] proc-vmcore: wrong data type casting fix
From: Baoquan He Subject: Re: [PATCH V2] proc-vmcore: wrong data type casting fix Date: Mon, 14 Mar 2016 11:50:53 +0800 > On 03/14/16 at 11:47am, Baoquan He wrote: >> On 03/14/16 at 12:25pm, HATAYAMA Daisuke wrote: >> > From: Dave Young >> > Subject: [PATCH V2] proc-vmcore: wrong data type casting fix >> > Date: Fri, 11 Mar 2016 16:42:48 +0800 >> > >> > > On i686 PAE enabled machine the contiguous physical area could be large >> > > and it can cause trimming down variables in below calculation in >> > > read_vmcore() and mmap_vmcore(): >> > > >> > > tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); >> > > >> > > Then the real size passed down is not correct any more. >> > > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then >> > >> > That is, size_t and loff_t are defined as follows on i686: >> > >> > (gdb) ptype size_t >> > type = unsigned int >> > (gdb) ptype loff_t >> > type = long long int >> > >> > So casting by size_t means truncating a given value by 4GB. >> > >> > Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB, >> > and is aligned with 4GB, the resulted value is 0, and >> >> Truncating doesn't mean align. If (m->offset + m->size - *fpos) is >> larger than 4G, E.g 0x1000f, the truncating result will be 0xf. >> Thanks for pointing out. I was confused with truncation and alignment. I wanted to say here... >> > >> > > we will get tsz = 0. It is of course not an expected result. >> >> We won't always get "tsz=0", just get the lower 32 bit value. truncating the upper 32 bits and just getting the lower 32 bits as you say. > > OK, didn't get you still have saying in next paragraph. Ignore this > please. > >> >> > > >> > > During our tests there are two problems caused by it: >> > > 1) read_vmcore will refuse to continue so makedumpfile fails. >> > > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). >> > > >> > >> > we reach these errors. >> > >> > If (m->offset + m->size - *fpos) is not aligned with 4GB, >> > read_vmcore() or mmap_vmcore() is performed with the truncated >> > non-zero value as size (of course, this is also not expected value but >> > the execution doesn't result in error). Then, fpos proceeds so that >> > (m->offset + m->size - *fpos) is aligned with 4GB in the next loop, >> > and we after all reach the errors. >> > >> > I think your patch description needs a bit more detail. >> > >> > It seems good to me that the patch itself. >> > >> > > Use unsigned long long in min_t instead so that the variables are not >> > > truncated. >> > > >> > > Signed-off-by: Baoquan He >> > > Signed-off-by: Dave Young >> > > --- >> > > v1->v2: spelling fix in patch log >> > > fs/proc/vmcore.c |7 +-- >> > > 1 file changed, 5 insertions(+), 2 deletions(-) >> > > >> > > --- linux-x86.orig/fs/proc/vmcore.c >> > > +++ linux-x86/fs/proc/vmcore.c >> > > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe >> > > >> > > list_for_each_entry(m, _list, list) { >> > > if (*fpos < m->offset + m->size) { >> > > -tsz = min_t(size_t, m->offset + m->size - >> > > *fpos, buflen); >> > > +tsz = (size_t)min_t(unsigned long long, >> > > +m->offset + m->size - *fpos, >> > > +buflen); >> > > start = m->paddr + *fpos - m->offset; >> > > tmp = read_from_oldmem(buffer, tsz, , >> > > userbuf); >> > > if (tmp < 0) >> > > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file >> > > if (start < m->offset + m->size) { >> > > u64 paddr = 0; >> > > >> > > -tsz = min_t(size_t, m->offset + m->size - >> > > start, size); >> > > +tsz = (size_t)min_t(unsigned long long, >> > > +m->offset + m->size - >> > > start, size); >> > > paddr = m->paddr + start - m->offset; >> > > if (vmcore_remap_oldmem_pfn(vma, vma->vm_start >> > > + len, >> > > paddr >> >> > > PAGE_SHIFT, tsz, >> > > >> > -- >> > Thanks. >> > HATAYAMA, Daisuke > -- Thanks. HATAYAMA, Daisuke
Re: [PATCH V2] proc-vmcore: wrong data type casting fix
From: Dave Young <dyo...@redhat.com> Subject: [PATCH V2] proc-vmcore: wrong data type casting fix Date: Fri, 11 Mar 2016 16:42:48 +0800 > On i686 PAE enabled machine the contiguous physical area could be large > and it can cause trimming down variables in below calculation in > read_vmcore() and mmap_vmcore(): > > tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); > > Then the real size passed down is not correct any more. > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then That is, size_t and loff_t are defined as follows on i686: (gdb) ptype size_t type = unsigned int (gdb) ptype loff_t type = long long int So casting by size_t means truncating a given value by 4GB. Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB, and is aligned with 4GB, the resulted value is 0, and > we will get tsz = 0. It is of course not an expected result. > > During our tests there are two problems caused by it: > 1) read_vmcore will refuse to continue so makedumpfile fails. > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). > we reach these errors. If (m->offset + m->size - *fpos) is not aligned with 4GB, read_vmcore() or mmap_vmcore() is performed with the truncated non-zero value as size (of course, this is also not expected value but the execution doesn't result in error). Then, fpos proceeds so that (m->offset + m->size - *fpos) is aligned with 4GB in the next loop, and we after all reach the errors. I think your patch description needs a bit more detail. It seems good to me that the patch itself. > Use unsigned long long in min_t instead so that the variables are not > truncated. > > Signed-off-by: Baoquan He <b...@redhat.com> > Signed-off-by: Dave Young <dyo...@redhat.com> > --- > v1->v2: spelling fix in patch log > fs/proc/vmcore.c |7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > --- linux-x86.orig/fs/proc/vmcore.c > +++ linux-x86/fs/proc/vmcore.c > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe > > list_for_each_entry(m, _list, list) { > if (*fpos < m->offset + m->size) { > - tsz = min_t(size_t, m->offset + m->size - *fpos, > buflen); > + tsz = (size_t)min_t(unsigned long long, > + m->offset + m->size - *fpos, > + buflen); > start = m->paddr + *fpos - m->offset; > tmp = read_from_oldmem(buffer, tsz, , userbuf); > if (tmp < 0) > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file > if (start < m->offset + m->size) { > u64 paddr = 0; > > - tsz = min_t(size_t, m->offset + m->size - start, size); > + tsz = (size_t)min_t(unsigned long long, > + m->offset + m->size - start, size); > paddr = m->paddr + start - m->offset; > if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len, > paddr >> PAGE_SHIFT, tsz, > -- Thanks. HATAYAMA, Daisuke
Re: [PATCH V2] proc-vmcore: wrong data type casting fix
From: Dave Young Subject: [PATCH V2] proc-vmcore: wrong data type casting fix Date: Fri, 11 Mar 2016 16:42:48 +0800 > On i686 PAE enabled machine the contiguous physical area could be large > and it can cause trimming down variables in below calculation in > read_vmcore() and mmap_vmcore(): > > tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); > > Then the real size passed down is not correct any more. > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then That is, size_t and loff_t are defined as follows on i686: (gdb) ptype size_t type = unsigned int (gdb) ptype loff_t type = long long int So casting by size_t means truncating a given value by 4GB. Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB, and is aligned with 4GB, the resulted value is 0, and > we will get tsz = 0. It is of course not an expected result. > > During our tests there are two problems caused by it: > 1) read_vmcore will refuse to continue so makedumpfile fails. > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). > we reach these errors. If (m->offset + m->size - *fpos) is not aligned with 4GB, read_vmcore() or mmap_vmcore() is performed with the truncated non-zero value as size (of course, this is also not expected value but the execution doesn't result in error). Then, fpos proceeds so that (m->offset + m->size - *fpos) is aligned with 4GB in the next loop, and we after all reach the errors. I think your patch description needs a bit more detail. It seems good to me that the patch itself. > Use unsigned long long in min_t instead so that the variables are not > truncated. > > Signed-off-by: Baoquan He > Signed-off-by: Dave Young > --- > v1->v2: spelling fix in patch log > fs/proc/vmcore.c |7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > --- linux-x86.orig/fs/proc/vmcore.c > +++ linux-x86/fs/proc/vmcore.c > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe > > list_for_each_entry(m, _list, list) { > if (*fpos < m->offset + m->size) { > - tsz = min_t(size_t, m->offset + m->size - *fpos, > buflen); > + tsz = (size_t)min_t(unsigned long long, > + m->offset + m->size - *fpos, > + buflen); > start = m->paddr + *fpos - m->offset; > tmp = read_from_oldmem(buffer, tsz, , userbuf); > if (tmp < 0) > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file > if (start < m->offset + m->size) { > u64 paddr = 0; > > - tsz = min_t(size_t, m->offset + m->size - start, size); > + tsz = (size_t)min_t(unsigned long long, > + m->offset + m->size - start, size); > paddr = m->paddr + start - m->offset; > if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len, > paddr >> PAGE_SHIFT, tsz, > -- Thanks. HATAYAMA, Daisuke
[PATCH v3 0/2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
This patch set fixes a bug introduced in the commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45. For the v2 discussion, please follow the following thread: http://thread.gmane.org/gmane.linux.kernel/1902488 The v3 patch newly adds a patch suggested by Ingo Molnar. --- HATAYAMA Daisuke (2): kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path include/linux/kernel.h |3 +++ kernel/kexec.c | 11 +++ kernel/panic.c |5 +++-- 3 files changed, 17 insertions(+), 2 deletions(-) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced "crash_kexec_post_notifiers" kernel boot option, which toggles wheather panic() calls crash_kexec() before panic_notifiers and dump kmsg or after. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). Also, put a comment in kexec_should_crash() to explain not obvious things on this patch. Signed-off-by: HATAYAMA Daisuke Acked-by: Baoquan He Tested-by: Hidehiro Kawai Reviewed-by: Masami Hiramatsu --- include/linux/kernel.h |3 +++ kernel/kexec.c | 11 +++ kernel/panic.c |2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3a5b48e..bd5331c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -438,6 +438,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 7a36fdc..a785c10 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,17 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + /* +* If crash_kexec_post_notifiers is enabled, don't run +* crash_kexec() here yet, which must be run after panic +* notifiers in panic(). +*/ + if (crash_kexec_post_notifiers) + return 0; + /* +* There are 4 panic() calls in do_exit() path, each of which +* corresponds to each of these 4 conditions. +*/ if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 774614f..04e91ff 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/2] kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled
For compatibility with the behaviour before the commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45, the 2nd crash_kexec() should be called only if crash_kexec_post_notifiers is enabled. Note that crash_kexec() returns immediately if kdump crash kernel is not loaded, so in this case, this patch makes no functionality change, but the point is to make it explicit, from the caller panic() side, that the 2nd crash_kexec() does nothing. Signed-off-by: HATAYAMA Daisuke Suggested-by: Ingo Molnar --- kernel/panic.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/panic.c b/kernel/panic.c index 8136ad7..774614f 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -142,7 +142,8 @@ void panic(const char *fmt, ...) * Note: since some panic_notifiers can make crashed kernel * more unstable, it can increase risks of the kdump failure too. */ - crash_kexec(NULL); + if (crash_kexec_post_notifiers) + crash_kexec(NULL); bust_spinlocks(0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/2] kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled
For compatibility with the behaviour before the commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45, the 2nd crash_kexec() should be called only if crash_kexec_post_notifiers is enabled. Note that crash_kexec() returns immediately if kdump crash kernel is not loaded, so in this case, this patch makes no functionality change, but the point is to make it explicit, from the caller panic() side, that the 2nd crash_kexec() does nothing. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Suggested-by: Ingo Molnar mi...@kernel.org --- kernel/panic.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/panic.c b/kernel/panic.c index 8136ad7..774614f 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -142,7 +142,8 @@ void panic(const char *fmt, ...) * Note: since some panic_notifiers can make crashed kernel * more unstable, it can increase risks of the kdump failure too. */ - crash_kexec(NULL); + if (crash_kexec_post_notifiers) + crash_kexec(NULL); bust_spinlocks(0); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before panic_notifiers and dump kmsg or after. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Also, put a comment in kexec_should_crash() to explain not obvious things on this patch. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Baoquan He b...@redhat.com Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com Reviewed-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com --- include/linux/kernel.h |3 +++ kernel/kexec.c | 11 +++ kernel/panic.c |2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3a5b48e..bd5331c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -438,6 +438,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 7a36fdc..a785c10 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,17 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + /* +* If crash_kexec_post_notifiers is enabled, don't run +* crash_kexec() here yet, which must be run after panic +* notifiers in panic(). +*/ + if (crash_kexec_post_notifiers) + return 0; + /* +* There are 4 panic() calls in do_exit() path, each of which +* corresponds to each of these 4 conditions. +*/ if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 774614f..04e91ff 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
This patch set fixes a bug introduced in the commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45. For the v2 discussion, please follow the following thread: http://thread.gmane.org/gmane.linux.kernel/1902488 The v3 patch newly adds a patch suggested by Ingo Molnar. --- HATAYAMA Daisuke (2): kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path include/linux/kernel.h |3 +++ kernel/kexec.c | 11 +++ kernel/panic.c |5 +++-- 3 files changed, 17 insertions(+), 2 deletions(-) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced "crash_kexec_post_notifiers" kernel boot option, which toggles wheather panic() calls crash_kexec() before panic_notifiers and dump kmsg or after. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). Also, put a comment in kexec_should_crash() to explain not obvious things on this patch. Signed-off-by: HATAYAMA Daisuke Acked-by: Baoquan He Tested-by: Hidehiro Kawai Reviewed-by: Masami Hiramatsu --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 11 +++ kernel/panic.c | 2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index d6d630d..07483c7 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 38c25b1..5bf6077 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,17 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + /* +* If crash_kexec_post_notifiers is enabled, don't run +* crash_kexec() here yet, which must be run after panic +* notifiers in panic(). +*/ + if (crash_kexec_post_notifiers) + return 0; + /* +* There are 4 panic() calls in do_exit() path, each of which +* calls corresponds to each of these 4 conditions. +*/ if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 8136ad7..79ca912 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before panic_notifiers and dump kmsg or after. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Also, put a comment in kexec_should_crash() to explain not obvious things on this patch. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Baoquan He b...@redhat.com Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com Reviewed-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 11 +++ kernel/panic.c | 2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index d6d630d..07483c7 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 38c25b1..5bf6077 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,17 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + /* +* If crash_kexec_post_notifiers is enabled, don't run +* crash_kexec() here yet, which must be run after panic +* notifiers in panic(). +*/ + if (crash_kexec_post_notifiers) + return 0; + /* +* There are 4 panic() calls in do_exit() path, each of which +* calls corresponds to each of these 4 conditions. +*/ if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 8136ad7..79ca912 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
From: Vivek Goyal Subject: Re: [RESEND PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path Date: Thu, 5 Mar 2015 17:22:04 -0500 > On Thu, Mar 05, 2015 at 05:19:30PM -0500, Vivek Goyal wrote: >> On Wed, Mar 04, 2015 at 05:56:48PM +0900, HATAYAMA Daisuke wrote: >> > The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced >> > "crash_kexec_post_notifiers" kernel boot option, which toggles >> > wheather panic() calls crash_kexec() before or after panic_notifiers >> > and dump kmsg. >> > >> > The problem is that the commit overlooks panic_on_oops kernel boot >> > option. If it is enabled, crash_kexec() is called directly without >> > going through panic() in oops path. >> > >> > To fix this issue, this patch adds a check to >> > "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). >> > >> > Signed-off-by: HATAYAMA Daisuke >> > Acked-by: Baoquan He >> > Tested-by: Hidehiro Kawai >> > --- >> > include/linux/kernel.h | 3 +++ >> > kernel/kexec.c | 2 ++ >> > kernel/panic.c | 2 +- >> > 3 files changed, 6 insertions(+), 1 deletion(-) >> > >> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h >> > index 64ce58b..f47379f 100644 >> > --- a/include/linux/kernel.h >> > +++ b/include/linux/kernel.h >> > @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; >> > extern int panic_on_io_nmi; >> > extern int panic_on_warn; >> > extern int sysctl_panic_on_stackoverflow; >> > + >> > +extern bool crash_kexec_post_notifiers; >> > + >> > /* >> > * Only to be used by arch init code. If the user over-wrote the default >> > * CONFIG_PANIC_TIMEOUT, honor it. >> > diff --git a/kernel/kexec.c b/kernel/kexec.c >> > index 9a8a01a..0ecf252 100644 >> > --- a/kernel/kexec.c >> > +++ b/kernel/kexec.c >> > @@ -84,6 +84,8 @@ struct resource crashk_low_res = { >> > >> > int kexec_should_crash(struct task_struct *p) >> > { >> > + if (crash_kexec_post_notifiers) >> > + return 0; >> >> This is little confusing. So if crash_kexec_post_notifiers is set but >> panic_on_oops is not set, still we will return? >> >> Should we do this only if panic_on_oops is set? IOW, how about following >> >> if (panic_on_oops && crash_kexec_post_notifiers) >> return 0; >> >> And then also put a comment explaining the rationale. > > Ok, I went through the previous version of patch and discussion there > which says that all the 4 conditions lead to panic. So putting above > code should be fine. > > Can you please atleast put a comment here to explain it as it was not > obvious. Just mention that all the checks below lead to panic hence > if user wants to run panic notifiers then don't run crash_kexec() yet. > It will be run after panic notifiers. > Thanks for your reviewing. Yes, I'll put such new comment in the patch of next version. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
From: Vivek Goyal vgo...@redhat.com Subject: Re: [RESEND PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path Date: Thu, 5 Mar 2015 17:22:04 -0500 On Thu, Mar 05, 2015 at 05:19:30PM -0500, Vivek Goyal wrote: On Wed, Mar 04, 2015 at 05:56:48PM +0900, HATAYAMA Daisuke wrote: The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Baoquan He b...@redhat.com Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; This is little confusing. So if crash_kexec_post_notifiers is set but panic_on_oops is not set, still we will return? Should we do this only if panic_on_oops is set? IOW, how about following if (panic_on_oops crash_kexec_post_notifiers) return 0; And then also put a comment explaining the rationale. Ok, I went through the previous version of patch and discussion there which says that all the 4 conditions lead to panic. So putting above code should be fine. Can you please atleast put a comment here to explain it as it was not obvious. Just mention that all the checks below lead to panic hence if user wants to run panic notifiers then don't run crash_kexec() yet. It will be run after panic notifiers. Thanks for your reviewing. Yes, I'll put such new comment in the patch of next version. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced "crash_kexec_post_notifiers" kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke Acked-by: Baoquan He Tested-by: Hidehiro Kawai --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Baoquan He b...@redhat.com Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
Hello Eric, Vivek, Do you have any comment to this patch? On 2015/02/05 17:59, HATAYAMA Daisuke wrote: The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced "crash_kexec_post_notifiers" kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
Hello Eric, Vivek, Do you have any comment to this patch? On 2015/02/05 17:59, HATAYAMA Daisuke wrote: The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
Hello, From: Baoquan He Subject: Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path Date: Mon, 9 Feb 2015 10:40:30 +0800 > On 02/05/15 at 05:59pm, HATAYAMA Daisuke wrote: >> The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced >> "crash_kexec_post_notifiers" kernel boot option, which toggles >> wheather panic() calls crash_kexec() before or after panic_notifiers >> and dump kmsg. >> >> The problem is that the commit overlooks panic_on_oops kernel boot >> option. If it is enabled, crash_kexec() is called directly without >> going through panic() in oops path. >> >> To fix this issue, this patch adds a check to >> "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). >> >> Signed-off-by: HATAYAMA Daisuke >> --- >> include/linux/kernel.h | 3 +++ >> kernel/kexec.c | 2 ++ >> kernel/panic.c | 2 +- >> 3 files changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/kernel.h b/include/linux/kernel.h >> index 64ce58b..f47379f 100644 >> --- a/include/linux/kernel.h >> +++ b/include/linux/kernel.h >> @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; >> extern int panic_on_io_nmi; >> extern int panic_on_warn; >> extern int sysctl_panic_on_stackoverflow; >> + >> +extern bool crash_kexec_post_notifiers; >> + >> /* >> * Only to be used by arch init code. If the user over-wrote the default >> * CONFIG_PANIC_TIMEOUT, honor it. >> diff --git a/kernel/kexec.c b/kernel/kexec.c >> index 9a8a01a..0ecf252 100644 >> --- a/kernel/kexec.c >> +++ b/kernel/kexec.c >> @@ -84,6 +84,8 @@ struct resource crashk_low_res = { >> >> int kexec_should_crash(struct task_struct *p) >> { >> +if (crash_kexec_post_notifiers) >> +return 0; >> if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) >> return 1; > > What if these two conditions !p->pid || is_global_init(p) are satisfied? > Seems the behavious is changed. > Please further follow do_exit() path. For each condition, there are the corresponding panic() calls. In summary: oops_end 1) panic() for in_interrupt() 2) panic() for panic_on_oops do_exit 3) panic() for !p->pid (idle task) exit_notify forget_original_parent find_child_reaper 4) panic() for p->pid == 1 (init task) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
Hello, From: Baoquan He b...@redhat.com Subject: Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path Date: Mon, 9 Feb 2015 10:40:30 +0800 On 02/05/15 at 05:59pm, HATAYAMA Daisuke wrote: The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { +if (crash_kexec_post_notifiers) +return 0; if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; What if these two conditions !p-pid || is_global_init(p) are satisfied? Seems the behavious is changed. Please further follow do_exit() path. For each condition, there are the corresponding panic() calls. In summary: oops_end 1) panic() for in_interrupt() 2) panic() for panic_on_oops do_exit 3) panic() for !p-pid (idle task) exit_notify forget_original_parent find_child_reaper 4) panic() for p-pid == 1 (init task) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced "crash_kexec_post_notifiers" kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to "crash_kexec_post_notifiers" in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced crash_kexec_post_notifiers kernel boot option, which toggles wheather panic() calls crash_kexec() before or after panic_notifiers and dump kmsg. The problem is that the commit overlooks panic_on_oops kernel boot option. If it is enabled, crash_kexec() is called directly without going through panic() in oops path. To fix this issue, this patch adds a check to crash_kexec_post_notifiers in the condition of kexec_should_crash(). Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- include/linux/kernel.h | 3 +++ kernel/kexec.c | 2 ++ kernel/panic.c | 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 64ce58b..f47379f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + /* * Only to be used by arch init code. If the user over-wrote the default * CONFIG_PANIC_TIMEOUT, honor it. diff --git a/kernel/kexec.c b/kernel/kexec.c index 9a8a01a..0ecf252 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -84,6 +84,8 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { + if (crash_kexec_post_notifiers) + return 0; if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops) return 1; return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f9..6582546 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -32,7 +32,7 @@ static unsigned long tainted_mask; static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -static bool crash_kexec_post_notifiers; +bool crash_kexec_post_notifiers; int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Fri, 14 Nov 2014 13:36:10 +0100 > On Fri, 14 Nov 2014 18:54:23 +0900 (JST) > HATAYAMA Daisuke wrote: > >> From: Petr Tesarik >> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> VMCOREINFO >> Date: Fri, 14 Nov 2014 09:31:45 +0100 >> >> > On Fri, 14 Nov 2014 10:42:35 +0900 (JST) >> > HATAYAMA Daisuke wrote: >> > >> >> From: Petr Tesarik >> >> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> >> VMCOREINFO >> >> Date: Thu, 13 Nov 2014 15:48:10 +0100 >> >> >> >> > On Thu, 13 Nov 2014 09:25:48 -0500 >> >> > Vivek Goyal wrote: >> >> > >> >> >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: >> >> >> > >> >> >> > (2014/11/13 17:06), Petr Tesarik wrote: >> >> >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST) >> >> >> > >HATAYAMA Daisuke wrote: >> >> >> > > >> >> >> > >>From: Vivek Goyal >> >> >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base >> >> >> > >>in VMCOREINFO >> >> >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500 >> >> >> > >> >> >> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: >> >> >> > >>>>Currently, VMCOREINFO note information reports the virtual >> >> >> > >>>>address of >> >> >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't >> >> >> > >>>>make >> >> >> > >>>>sense because to refer to value of the phys_base, it's necessary >> >> >> > >>>>to >> >> >> > >>>>get the value of phys_base itself we are now about to refer to. >> >> >> > >>>> >> >> >> > >>> >> >> >> > >>>Hi Hatayama, >> >> >> > >>> >> >> >> > >>>/proc/vmcore ELF headers have virtual address information and >> >> >> > >>>using >> >> >> > >>>that you should be able to read actual value of phys_base. gdb >> >> >> > >>>deals >> >> >> > >>>with virtual addresses all the time and can read value of any >> >> >> > >>>symbol >> >> >> > >>>using those headers. >> >> >> > >>> >> >> >> > >>>So I am not sure what's the need for exporting actual value of >> >> >> > >>>phys_base. >> >> >> > >>> >> >> >> > >> >> >> >> > >>Sorry, my logic in the patch description was wrong. For >> >> >> > >>/proc/vmcore, >> >> >> > >>there's enough information for makedumpdile to get phys_base. It's >> >> >> > >>correct. The problem here is that other crash dump mechanisms that >> >> >> > >>run >> >> >> > >>outside Linux kernel independently don't have information to get >> >> >> > >>phys_base. >> >> >> > > >> >> >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, >> >> >> > >will >> >> >> > >they? >> >> >> > > >> >> >> > >> >> >> > I don't intend such sophisticated function only by VMCOREINFO. >> >> >> > Search vmcore for VMCOREINFO using strings + grep before opening it >> >> >> > by crash. >> >> >> > I intend that only here. >> >> >> >> >> >> I think this is very crude and not proper way to get to vmcoreinfo. >> >> > >> >> > Same here. If VMCOREINFO must be locatable without communicating any >> >> > information to the hypervisor, then I would rather go for something >> >> > similar to what s390(x) folks do - a well-known location in physical >> >> > memory that contains a pointer to a checksummed OS info struct
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Fri, 14 Nov 2014 13:36:10 +0100 On Fri, 14 Nov 2014 18:54:23 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Fri, 14 Nov 2014 09:31:45 +0100 On Fri, 14 Nov 2014 10:42:35 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 15:48:10 +0100 On Thu, 13 Nov 2014 09:25:48 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: (2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. I think this is very crude and not proper way to get to vmcoreinfo. Same here. If VMCOREINFO must be locatable without communicating any information to the hypervisor, then I would rather go for something similar to what s390(x) folks do - a well-known location in physical memory that contains a pointer to a checksummed OS info structure, which in turn contains the VMCOREINFO pointers. I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP. Or is that part of the current plan, Daisuke? It's useful if there is. I don't plan now. For now, the idea of this patch is enough for me. BTW, for the above idea, I suspect that if the location in the physical memory is unique, it cannot deal with the kdump 2nd kernel case. No, not at all. The low 640K are copied away to a pre-allocated area by kexec purgatory code on x86_64, so it's safe to overwrite any location in there. The copy is needed, because BIOS already uses some hardcoded addresses in that range. I think the Linux kernel may safely use part of PFN 0 starting at physical address 0x0500. This area was originally used by MS-DOS, so chances are high that no broken BIOS out there corrupts this part of RAM... In fact, I didn't consider in such deep way... I had forgot back up region at all. But it's hard to use the low 640K area. Then, it's hard to get phys_base of the kdump 1st kernel that is assumed to be saved in thw low 640K now. Because externally running mechanism can run after kdump 2nd kernel has booted up, crash utility needs to convert a read request to the low 640K area into the corresponding part of the pre-allocated area. See kdump_backup_region_init() in crash utility, which tries to find the pre-allocated area via ELF header, where symbol kexec_crash_image is read to find ELF header. This means we need phys_base to find the pre-allocated area. Wrong again, I'm afraid. So, first of all, an admin should make up your mind if you want to use kexec-based dumping, or stand-alone dumping. OK, you seem to address a corner case when s/he configures both. But in that case, the It's a never corner case. We usually use both. There's difference in data reliability between kdump and others in that kdump can do cleanup in kernel logic level at the end of the kdump 1st kernel prior to kdump 2nd kernel, and difference in dumping feature that there's makedumpfile that can filter memory
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Fri, 14 Nov 2014 09:31:45 +0100 > On Fri, 14 Nov 2014 10:42:35 +0900 (JST) > HATAYAMA Daisuke wrote: > >> From: Petr Tesarik >> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> VMCOREINFO >> Date: Thu, 13 Nov 2014 15:48:10 +0100 >> >> > On Thu, 13 Nov 2014 09:25:48 -0500 >> > Vivek Goyal wrote: >> > >> >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: >> >> > >> >> > (2014/11/13 17:06), Petr Tesarik wrote: >> >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST) >> >> > >HATAYAMA Daisuke wrote: >> >> > > >> >> > >>From: Vivek Goyal >> >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> >> > >>VMCOREINFO >> >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500 >> >> > >> >> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: >> >> > >>>>Currently, VMCOREINFO note information reports the virtual address >> >> > >>>>of >> >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't >> >> > >>>>make >> >> > >>>>sense because to refer to value of the phys_base, it's necessary to >> >> > >>>>get the value of phys_base itself we are now about to refer to. >> >> > >>>> >> >> > >>> >> >> > >>>Hi Hatayama, >> >> > >>> >> >> > >>>/proc/vmcore ELF headers have virtual address information and using >> >> > >>>that you should be able to read actual value of phys_base. gdb deals >> >> > >>>with virtual addresses all the time and can read value of any symbol >> >> > >>>using those headers. >> >> > >>> >> >> > >>>So I am not sure what's the need for exporting actual value of >> >> > >>>phys_base. >> >> > >>> >> >> > >> >> >> > >>Sorry, my logic in the patch description was wrong. For /proc/vmcore, >> >> > >>there's enough information for makedumpdile to get phys_base. It's >> >> > >>correct. The problem here is that other crash dump mechanisms that run >> >> > >>outside Linux kernel independently don't have information to get >> >> > >>phys_base. >> >> > > >> >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, will >> >> > >they? >> >> > > >> >> > >> >> > I don't intend such sophisticated function only by VMCOREINFO. >> >> > Search vmcore for VMCOREINFO using strings + grep before opening it by >> >> > crash. >> >> > I intend that only here. >> >> >> >> I think this is very crude and not proper way to get to vmcoreinfo. >> > >> > Same here. If VMCOREINFO must be locatable without communicating any >> > information to the hypervisor, then I would rather go for something >> > similar to what s390(x) folks do - a well-known location in physical >> > memory that contains a pointer to a checksummed OS info structure, >> > which in turn contains the VMCOREINFO pointers. >> > >> > I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP. >> > Or is that part of the current plan, Daisuke? >> > >> >> It's useful if there is. I don't plan now. For now, the idea of this >> patch is enough for me. >> >> BTW, for the above idea, I suspect that if the location in the >> physical memory is unique, it cannot deal with the kdump 2nd kernel >> case. > > No, not at all. The low 640K are copied away to a pre-allocated area by > kexec purgatory code on x86_64, so it's safe to overwrite any location > in there. The copy is needed, because BIOS already uses some hardcoded > addresses in that range. I think the Linux kernel may safely use part of > PFN 0 starting at physical address 0x0500. This area was originally > used by MS-DOS, so chances are high that no broken BIOS out there > corrupts this part of RAM... > In fact, I didn't consider in such deep way... I had forgot back up region at all. But it's hard to use the low 640K area. Then, it's hard to get phys_base of the kdump 1st kernel that is assumed to be saved in thw low 640K now. Because externally running mechanism can run after kdump 2nd kernel has booted up, crash utility needs to convert a read request to the low 640K area into the corresponding part of the pre-allocated area. See kdump_backup_region_init() in crash utility, which tries to find the pre-allocated area via ELF header, where symbol kexec_crash_image is read to find ELF header. This means we need phys_base to find the pre-allocated area. > Anyway, I'm not going to implement it right now for lack of time. I'm > adding it to my TODO list, but if anybody wants to post a patch, I > won't be offended. > > Petr T -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Fri, 14 Nov 2014 09:31:45 +0100 On Fri, 14 Nov 2014 10:42:35 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 15:48:10 +0100 On Thu, 13 Nov 2014 09:25:48 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: (2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. I think this is very crude and not proper way to get to vmcoreinfo. Same here. If VMCOREINFO must be locatable without communicating any information to the hypervisor, then I would rather go for something similar to what s390(x) folks do - a well-known location in physical memory that contains a pointer to a checksummed OS info structure, which in turn contains the VMCOREINFO pointers. I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP. Or is that part of the current plan, Daisuke? It's useful if there is. I don't plan now. For now, the idea of this patch is enough for me. BTW, for the above idea, I suspect that if the location in the physical memory is unique, it cannot deal with the kdump 2nd kernel case. No, not at all. The low 640K are copied away to a pre-allocated area by kexec purgatory code on x86_64, so it's safe to overwrite any location in there. The copy is needed, because BIOS already uses some hardcoded addresses in that range. I think the Linux kernel may safely use part of PFN 0 starting at physical address 0x0500. This area was originally used by MS-DOS, so chances are high that no broken BIOS out there corrupts this part of RAM... In fact, I didn't consider in such deep way... I had forgot back up region at all. But it's hard to use the low 640K area. Then, it's hard to get phys_base of the kdump 1st kernel that is assumed to be saved in thw low 640K now. Because externally running mechanism can run after kdump 2nd kernel has booted up, crash utility needs to convert a read request to the low 640K area into the corresponding part of the pre-allocated area. See kdump_backup_region_init() in crash utility, which tries to find the pre-allocated area via ELF header, where symbol kexec_crash_image is read to find ELF header. This means we need phys_base to find the pre-allocated area. Anyway, I'm not going to implement it right now for lack of time. I'm adding it to my TODO list, but if anybody wants to post a patch, I won't be offended. Petr T -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 15:48:10 +0100 > On Thu, 13 Nov 2014 09:25:48 -0500 > Vivek Goyal wrote: > >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: >> > >> > (2014/11/13 17:06), Petr Tesarik wrote: >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST) >> > >HATAYAMA Daisuke wrote: >> > > >> > >>From: Vivek Goyal >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> > >>VMCOREINFO >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500 >> > >> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: >> > >>>>Currently, VMCOREINFO note information reports the virtual address of >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't make >> > >>>>sense because to refer to value of the phys_base, it's necessary to >> > >>>>get the value of phys_base itself we are now about to refer to. >> > >>>> >> > >>> >> > >>>Hi Hatayama, >> > >>> >> > >>>/proc/vmcore ELF headers have virtual address information and using >> > >>>that you should be able to read actual value of phys_base. gdb deals >> > >>>with virtual addresses all the time and can read value of any symbol >> > >>>using those headers. >> > >>> >> > >>>So I am not sure what's the need for exporting actual value of >> > >>>phys_base. >> > >>> >> > >> >> > >>Sorry, my logic in the patch description was wrong. For /proc/vmcore, >> > >>there's enough information for makedumpdile to get phys_base. It's >> > >>correct. The problem here is that other crash dump mechanisms that run >> > >>outside Linux kernel independently don't have information to get >> > >>phys_base. >> > > >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, will >> > >they? >> > > >> > >> > I don't intend such sophisticated function only by VMCOREINFO. >> > Search vmcore for VMCOREINFO using strings + grep before opening it by >> > crash. >> > I intend that only here. >> >> I think this is very crude and not proper way to get to vmcoreinfo. > > Same here. If VMCOREINFO must be locatable without communicating any > information to the hypervisor, then I would rather go for something > similar to what s390(x) folks do - a well-known location in physical > memory that contains a pointer to a checksummed OS info structure, > which in turn contains the VMCOREINFO pointers. > > I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP. > Or is that part of the current plan, Daisuke? > It's useful if there is. I don't plan now. For now, the idea of this patch is enough for me. BTW, for the above idea, I suspect that if the location in the physical memory is unique, it cannot deal with the kdump 2nd kernel case. I think it better for the idea to be able to represent multiple kernel information. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Vivek Goyal Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 09:25:48 -0500 > On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: >> >> >> (2014/11/13 17:06), Petr Tesarik wrote: >> >On Thu, 13 Nov 2014 09:17:09 +0900 (JST) >> >HATAYAMA Daisuke wrote: >> > >> >>From: Vivek Goyal >> >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in >> >>VMCOREINFO >> >>Date: Wed, 12 Nov 2014 17:12:05 -0500 >> >> >> >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: >> >>>>Currently, VMCOREINFO note information reports the virtual address of >> >>>>phys_base that is assigned to symbol phys_base. But this doesn't make >> >>>>sense because to refer to value of the phys_base, it's necessary to >> >>>>get the value of phys_base itself we are now about to refer to. >> >>>> >> >>> >> >>>Hi Hatayama, >> >>> >> >>>/proc/vmcore ELF headers have virtual address information and using >> >>>that you should be able to read actual value of phys_base. gdb deals >> >>>with virtual addresses all the time and can read value of any symbol >> >>>using those headers. >> >>> >> >>>So I am not sure what's the need for exporting actual value of >> >>>phys_base. >> >>> >> >> >> >>Sorry, my logic in the patch description was wrong. For /proc/vmcore, >> >>there's enough information for makedumpdile to get phys_base. It's >> >>correct. The problem here is that other crash dump mechanisms that run >> >>outside Linux kernel independently don't have information to get >> >>phys_base. >> > >> >Yes, but these mechanisms won't be able to read VMCOREINFO either, will >> >they? >> > >> >> I don't intend such sophisticated function only by VMCOREINFO. >> Search vmcore for VMCOREINFO using strings + grep before opening it by crash. >> I intend that only here. > > I think this is very crude and not proper way to get to vmcoreinfo. Can I agree it's crude, but it's useful enough for my usecase. > you give more context. What are those mechanisms and what are you trying > to do. > I after all write the same thing in the patch description... I mean qemu dump, xendump (and other hypervisor dumps), firmware dumps implemented on each vendor system for the crash dump mechanism. But there's no prepared information now. So, crash utility has made effort to calculate phys_base from the existing information. One example is to check if how far linux_banner string is apart from where it actually is in vmcore, and another is to walk page tables via CR4 register if it is included in note information of the crash dump mechanism's environment. There's no gurantee that these always work well. For example, because these crash dump mechanisms run independently of Linux kernel, it could be when CPU moves to some firmware code. Then, CR4 could point at differnet page table. Looking at VMCOREINFO, there's already phys_base information but it's symbol value only and it doesn't make sense. So it should be corrected. I think this is natural. Also, I think it important to keep externally running crash dump mechansms as independent as possible. If they depend on linux kernel, it could reduce robustoness of their functionality. In case of phys_base, if we added a mechanism to get phys_base, it would mean that the crash dump mechisms doesn't work well until they got phys_base. For robustness, I think it best to make Linux kernel put necessary information anywhere and to make users to use them independently. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
(2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke wrote: From: Vivek Goyal Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
(2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 09:25:48 -0500 On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: (2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. I think this is very crude and not proper way to get to vmcoreinfo. Can I agree it's crude, but it's useful enough for my usecase. you give more context. What are those mechanisms and what are you trying to do. I after all write the same thing in the patch description... I mean qemu dump, xendump (and other hypervisor dumps), firmware dumps implemented on each vendor system for the crash dump mechanism. But there's no prepared information now. So, crash utility has made effort to calculate phys_base from the existing information. One example is to check if how far linux_banner string is apart from where it actually is in vmcore, and another is to walk page tables via CR4 register if it is included in note information of the crash dump mechanism's environment. There's no gurantee that these always work well. For example, because these crash dump mechanisms run independently of Linux kernel, it could be when CPU moves to some firmware code. Then, CR4 could point at differnet page table. Looking at VMCOREINFO, there's already phys_base information but it's symbol value only and it doesn't make sense. So it should be corrected. I think this is natural. Also, I think it important to keep externally running crash dump mechansms as independent as possible. If they depend on linux kernel, it could reduce robustoness of their functionality. In case of phys_base, if we added a mechanism to get phys_base, it would mean that the crash dump mechisms doesn't work well until they got phys_base. For robustness, I think it best to make Linux kernel put necessary information anywhere and to make users to use them independently. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Thu, 13 Nov 2014 15:48:10 +0100 On Thu, 13 Nov 2014 09:25:48 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote: (2014/11/13 17:06), Petr Tesarik wrote: On Thu, 13 Nov 2014 09:17:09 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Yes, but these mechanisms won't be able to read VMCOREINFO either, will they? I don't intend such sophisticated function only by VMCOREINFO. Search vmcore for VMCOREINFO using strings + grep before opening it by crash. I intend that only here. I think this is very crude and not proper way to get to vmcoreinfo. Same here. If VMCOREINFO must be locatable without communicating any information to the hypervisor, then I would rather go for something similar to what s390(x) folks do - a well-known location in physical memory that contains a pointer to a checksummed OS info structure, which in turn contains the VMCOREINFO pointers. I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP. Or is that part of the current plan, Daisuke? It's useful if there is. I don't plan now. For now, the idea of this patch is enough for me. BTW, for the above idea, I suspect that if the location in the physical memory is unique, it cannot deal with the kdump 2nd kernel case. I think it better for the idea to be able to represent multiple kernel information. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] kdump, vmcoreinfo: report actual value of phys_base
Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base on crash dump formats generated by mechanisms running outside Linux kernel, such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. That is, find a kernel data whose virtual and physical addresses are available via its note information and calculate phys_base from it. However, such data structure is not the one prepared for phys_base purpose. There's no guarantee that other crash dump mechanisms include such information that can be used to calculate phys_base similarly. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... This is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. ChangeLog: v2: - Introduce VMCOREINFO_PHYS_BASE(). - Correct patch description: this work is primarily for mechanisms running outside (kdump 1st) Linux kernel. Signed-off-by: HATAYAMA Daisuke --- arch/x86/kernel/machine_kexec_64.c | 1 + include/linux/kexec.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 4859810..47cc835 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) #endif vmcoreinfo_append_str("KERNELOFFSET=%lx\n", (unsigned long)&_text - __START_KERNEL); + VMCOREINFO_PHYS_BASE(phys_base); } /* arch-dependent functionality related to kexec file-based syscall */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 9d957b7..bee3c5b 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -258,6 +258,8 @@ unsigned long paddr_vmcoreinfo_note(void); vmcoreinfo_append_str("NUMBER(%s)=%ld\n", #name, (long)name) #define VMCOREINFO_CONFIG(name) \ vmcoreinfo_append_str("CONFIG_%s=y\n", #name) +#define VMCOREINFO_PHYS_BASE(value) \ + vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value) extern struct kimage *kexec_image; extern struct kimage *kexec_crash_image; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 09:14:34 +0100 > On Wed, 12 Nov 2014 15:40:42 +0900 (JST) > HATAYAMA Daisuke wrote: > >> Currently, VMCOREINFO note information reports the virtual address of >> phys_base that is assigned to symbol phys_base. But this doesn't make >> sense because to refer to value of the phys_base, it's necessary to >> get the value of phys_base itself we are now about to refer to. >> >> Userland tools related to kdump such as makedumpfile and crash utility >> so far have made some efforts to calculate phys_base from memory >> mapping information on a variety of crash dump formats. But there's no >> guarantee to keep maintaining it in the future. >> >> This is also useful for crash dump mechanism running outside Linux >> kernel such as virtual machine hypervisor such as qemu dump, which >> ordinary users use via virsh dump, or ones implemented on vendor >> specific firmware. They cannot get phys_base without special mechanism >> because phys_base is kernel information. >> >> To get VMCOREINFO in vmcore, it's easy to use strings and grep >> commands like this; VMCOREINFO consists of simple string: >> >> $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100 >> VMCOREINFO >> OSRELEASE=3.10.0-121.el7.x86_64 >> PAGESIZE=4096 >> ... >> >> Similarly, this is also useful to get value of phys_base in kdump 2nd >> kernel contained in vmcore using the above-mentioned external crash >> dump mechanism; kdump 2nd kernel is an inherently relocated kernel. >> >> This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because >> makedumpfile refers to it and if removing it, old versions >> makedumpfile doesn't work well. >> >> Signed-off-by: HATAYAMA Daisuke >> --- >> arch/x86/kernel/machine_kexec_64.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c >> b/arch/x86/kernel/machine_kexec_64.c >> index 4859810..e6d00a4 100644 >> --- a/arch/x86/kernel/machine_kexec_64.c >> +++ b/arch/x86/kernel/machine_kexec_64.c >> @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) >> #endif >> vmcoreinfo_append_str("KERNELOFFSET=%lx\n", >>(unsigned long)&_text - __START_KERNEL); >> +VMCOREINFO_LENGTH(phys_base, phys_base); > > While I fully agree with the concept, I don't like the use of > VMCOREINFO_LENGTH. LENGTH(symbol) has been used to store array length > in VMCOREINFO. > > OTOH there is currently no good syntax for storing a value (short of > VMCOREINFO_NUMBER, but that one is signed). I think it would be best to > extend the VMCOREINFO syntax for storing variables, preferably in > hexadecimal, e.g.: > > #define VMCOREINFO_VALUE(name, value) \ > vmcoreinfo_append_str("VALUE(%s)=%lx\n", #name, (unsigned long) value) > > This interface is somewhat suboptimal, because it can only store a > single long value. So, maybe we should dump the complete variable in > hex, or something similar... > > Anyone has a better idea? > > Petr T Just as you say, it's natural to write value of phys_base in hexadecimal format. For VMCOREINFO_VALUE, looking at the current helper macros, it seems to me more natural to make it have more specific name: #define VMCOREINFO_PHYS_BASE(value) \ vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long) value) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Vivek Goyal Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 > On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: >> Currently, VMCOREINFO note information reports the virtual address of >> phys_base that is assigned to symbol phys_base. But this doesn't make >> sense because to refer to value of the phys_base, it's necessary to >> get the value of phys_base itself we are now about to refer to. >> > > Hi Hatayama, > > /proc/vmcore ELF headers have virtual address information and using > that you should be able to read actual value of phys_base. gdb deals > with virtual addresses all the time and can read value of any symbol > using those headers. > > So I am not sure what's the need for exporting actual value of > phys_base. > Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 17:12:05 -0500 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Hi Hatayama, /proc/vmcore ELF headers have virtual address information and using that you should be able to read actual value of phys_base. gdb deals with virtual addresses all the time and can read value of any symbol using those headers. So I am not sure what's the need for exporting actual value of phys_base. Sorry, my logic in the patch description was wrong. For /proc/vmcore, there's enough information for makedumpdile to get phys_base. It's correct. The problem here is that other crash dump mechanisms that run outside Linux kernel independently don't have information to get phys_base. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
From: Petr Tesarik ptesa...@suse.cz Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO Date: Wed, 12 Nov 2014 09:14:34 +0100 On Wed, 12 Nov 2014 15:40:42 +0900 (JST) HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base from memory mapping information on a variety of crash dump formats. But there's no guarantee to keep maintaining it in the future. This is also useful for crash dump mechanism running outside Linux kernel such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. They cannot get phys_base without special mechanism because phys_base is kernel information. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... Similarly, this is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/machine_kexec_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 4859810..e6d00a4 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) #endif vmcoreinfo_append_str(KERNELOFFSET=%lx\n, (unsigned long)_text - __START_KERNEL); +VMCOREINFO_LENGTH(phys_base, phys_base); While I fully agree with the concept, I don't like the use of VMCOREINFO_LENGTH. LENGTH(symbol) has been used to store array length in VMCOREINFO. OTOH there is currently no good syntax for storing a value (short of VMCOREINFO_NUMBER, but that one is signed). I think it would be best to extend the VMCOREINFO syntax for storing variables, preferably in hexadecimal, e.g.: #define VMCOREINFO_VALUE(name, value) \ vmcoreinfo_append_str(VALUE(%s)=%lx\n, #name, (unsigned long) value) This interface is somewhat suboptimal, because it can only store a single long value. So, maybe we should dump the complete variable in hex, or something similar... Anyone has a better idea? Petr T Just as you say, it's natural to write value of phys_base in hexadecimal format. For VMCOREINFO_VALUE, looking at the current helper macros, it seems to me more natural to make it have more specific name: #define VMCOREINFO_PHYS_BASE(value) \ vmcoreinfo_append_str(PHYS_BASE=%lx\n, (unsigned long) value) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] kdump, vmcoreinfo: report actual value of phys_base
Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base on crash dump formats generated by mechanisms running outside Linux kernel, such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. That is, find a kernel data whose virtual and physical addresses are available via its note information and calculate phys_base from it. However, such data structure is not the one prepared for phys_base purpose. There's no guarantee that other crash dump mechanisms include such information that can be used to calculate phys_base similarly. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... This is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. ChangeLog: v2: - Introduce VMCOREINFO_PHYS_BASE(). - Correct patch description: this work is primarily for mechanisms running outside (kdump 1st) Linux kernel. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/machine_kexec_64.c | 1 + include/linux/kexec.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 4859810..47cc835 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) #endif vmcoreinfo_append_str(KERNELOFFSET=%lx\n, (unsigned long)_text - __START_KERNEL); + VMCOREINFO_PHYS_BASE(phys_base); } /* arch-dependent functionality related to kexec file-based syscall */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 9d957b7..bee3c5b 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -258,6 +258,8 @@ unsigned long paddr_vmcoreinfo_note(void); vmcoreinfo_append_str(NUMBER(%s)=%ld\n, #name, (long)name) #define VMCOREINFO_CONFIG(name) \ vmcoreinfo_append_str(CONFIG_%s=y\n, #name) +#define VMCOREINFO_PHYS_BASE(value) \ + vmcoreinfo_append_str(PHYS_BASE=%lx\n, (unsigned long)value) extern struct kimage *kexec_image; extern struct kimage *kexec_crash_image; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base from memory mapping information on a variety of crash dump formats. But there's no guarantee to keep maintaining it in the future. This is also useful for crash dump mechanism running outside Linux kernel such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. They cannot get phys_base without special mechanism because phys_base is kernel information. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... Similarly, this is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. Signed-off-by: HATAYAMA Daisuke --- arch/x86/kernel/machine_kexec_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 4859810..e6d00a4 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) #endif vmcoreinfo_append_str("KERNELOFFSET=%lx\n", (unsigned long)&_text - __START_KERNEL); + VMCOREINFO_LENGTH(phys_base, phys_base); } /* arch-dependent functionality related to kexec file-based syscall */ -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to value of the phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base from memory mapping information on a variety of crash dump formats. But there's no guarantee to keep maintaining it in the future. This is also useful for crash dump mechanism running outside Linux kernel such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. They cannot get phys_base without special mechanism because phys_base is kernel information. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... Similarly, this is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/machine_kexec_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 4859810..e6d00a4 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void) #endif vmcoreinfo_append_str(KERNELOFFSET=%lx\n, (unsigned long)_text - __START_KERNEL); + VMCOREINFO_LENGTH(phys_base, phys_base); } /* arch-dependent functionality related to kexec file-based syscall */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors
_start = pos + 1; > + } > + } > + if (pos > pos_start) { > + /* Remap the rest */ > + map_size = (pos - pos_start) << PAGE_SHIFT; > + if (remap_oldmem_pfn_range(vma, from + len, pos_start, > +map_size, prot)) > + goto fail; > + len += map_size; > + } > + return 0; > +fail: > + do_munmap(vma->vm_mm, from, len); > + return -EAGAIN; > +} > + > +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, > + unsigned long from, unsigned long pfn, > + unsigned long size, pgprot_t prot) > +{ > + /* > + * Check if oldmem_pfn_is_ram was registered to avoid > + * looping over all pages without a reason. > + */ > + if (oldmem_pfn_is_ram) > + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot); > + else > + return remap_oldmem_pfn_range(vma, from, pfn, size, prot); > +} > + > static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) > { > size_t size = vma->vm_end - vma->vm_start; > @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct > vm_area_struct *vma) > > tsz = min_t(size_t, m->offset + m->size - start, size); > paddr = m->paddr + start - m->offset; > - if (remap_oldmem_pfn_range(vma, vma->vm_start + len, > - paddr >> PAGE_SHIFT, tsz, > -vma->vm_page_prot)) > + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len, > + paddr >> PAGE_SHIFT, tsz, > + vma->vm_page_prot)) > goto fail; > size -= tsz; > start += tsz; > -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors
: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + /* + * Check if oldmem_pfn_is_ram was registered to avoid + * looping over all pages without a reason. + */ + if (oldmem_pfn_is_ram) + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot); + else + return remap_oldmem_pfn_range(vma, from, pfn, size, prot); +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) tsz = min_t(size_t, m-offset + m-size - start, size); paddr = m-paddr + start - m-offset; - if (remap_oldmem_pfn_range(vma, vma-vm_start + len, -paddr PAGE_SHIFT, tsz, -vma-vm_page_prot)) + if (vmcore_remap_oldmem_pfn(vma, vma-vm_start + len, + paddr PAGE_SHIFT, tsz, + vma-vm_page_prot)) goto fail; size -= tsz; start += tsz; -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors
range(vma, from + len, pos_start, > +map_size, vma->vm_page_prot)) Is prot correct here? > + goto fail; > + len += map_size; > + } > + return 0; > +fail: > + do_munmap(vma->vm_mm, from, len); > + return -EAGAIN; > +} > + > +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, > + unsigned long from, unsigned long pfn, > + unsigned long size, pgprot_t prot) > +{ > + /* > + * Check if oldmem_pfn_is_ram was registered to avoid > + * looping over all pages without a reason. > + */ > + if (oldmem_pfn_is_ram) > + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot); > + else > + return remap_oldmem_pfn_range(vma, from, pfn, size, prot); > +} > + > static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) > { > size_t size = vma->vm_end - vma->vm_start; > @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct > vm_area_struct *vma) > > tsz = min_t(size_t, m->offset + m->size - start, size); > paddr = m->paddr + start - m->offset; > - if (remap_oldmem_pfn_range(vma, vma->vm_start + len, > -paddr >> PAGE_SHIFT, tsz, > - vma->vm_page_prot)) > + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len, > + paddr >> PAGE_SHIFT, tsz, > + vma->vm_page_prot)) > goto fail; > size -= tsz; > start += tsz; > It looks to me good other than the above two comments. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors
, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + /* + * Check if oldmem_pfn_is_ram was registered to avoid + * looping over all pages without a reason. + */ + if (oldmem_pfn_is_ram) + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot); + else + return remap_oldmem_pfn_range(vma, from, pfn, size, prot); +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) tsz = min_t(size_t, m-offset + m-size - start, size); paddr = m-paddr + start - m-offset; - if (remap_oldmem_pfn_range(vma, vma-vm_start + len, -paddr PAGE_SHIFT, tsz, -vma-vm_page_prot)) + if (vmcore_remap_oldmem_pfn(vma, vma-vm_start + len, + paddr PAGE_SHIFT, tsz, + vma-vm_page_prot)) goto fail; size -= tsz; start += tsz; It looks to me good other than the above two comments. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Commit-ID: b292d7a10487aee6e74b1c18b8d95b92f40d4a4f Gitweb: http://git.kernel.org/tip/b292d7a10487aee6e74b1c18b8d95b92f40d4a4f Author: HATAYAMA Daisuke AuthorDate: Wed, 25 Jun 2014 10:09:07 +0900 Committer: Ingo Molnar CommitDate: Wed, 2 Jul 2014 08:35:55 +0200 perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke Acked-by: Don Zickus Signed-off-by: Peter Zijlstra Cc: Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatay...@jp.fujitsu.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *))) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *))) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Commit-ID: b292d7a10487aee6e74b1c18b8d95b92f40d4a4f Gitweb: http://git.kernel.org/tip/b292d7a10487aee6e74b1c18b8d95b92f40d4a4f Author: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com AuthorDate: Wed, 25 Jun 2014 10:09:07 +0900 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 2 Jul 2014 08:35:55 +0200 perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to indicate changes to the state of performancmonitoring hardware In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Don Zickus dzic...@redhat.com Signed-off-by: Peter Zijlstra pet...@infradead.org Cc: sta...@vger.kernel.org Cc: Arnaldo Carvalho de Melo a...@kernel.org Cc: Linus Torvalds torva...@linux-foundation.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatay...@jp.fujitsu.com Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/perf_event_intel.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *)status)) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *)status)) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Andi Kleen Subject: Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Mon, 30 Jun 2014 15:22:24 -0700 >> >> I'm also interested in the behaviour of CondChgd bit on Ivy Bridge >> processors. > > The intended meaning of CondChgd is that a hardware debugger has taken over > the > PMU. It shouldn't really be set in other circumstances. > Thanks for your explanation. The hardware debugger you mean is a kind of ITP? > I think right now for perf it would be best to just ignore it. > > In theory could stop using the PMU, but if some BIOS set it it would > completely disable perf there. So better to just ignore it. > I'll reflect this information in the patch description. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Andi Kleen a...@linux.intel.com Subject: Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Mon, 30 Jun 2014 15:22:24 -0700 I'm also interested in the behaviour of CondChgd bit on Ivy Bridge processors. The intended meaning of CondChgd is that a hardware debugger has taken over the PMU. It shouldn't really be set in other circumstances. Thanks for your explanation. The hardware debugger you mean is a kind of ITP? I think right now for perf it would be best to just ignore it. In theory could stop using the PMU, but if some BIOS set it it would completely disable perf there. So better to just ignore it. I'll reflect this information in the patch description. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Hello, (2014/06/17 0:30), Don Zickus wrote: On Fri, Jun 13, 2014 at 05:44:37PM +0900, HATAYAMA Daisuke wrote: Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. As I said in a previous email, I ran into a similar problem and was going to solve it by zeroing out all the registers on init (which probably would have upset Peter :-) ). This is a smaller solution and seems ok. The only downside is it is called in the nmi handler. I am working with our customer to try and talk with Intel why this bit is set to begin with. Our customer says their BIOS doesn't use the PMU during boot so it wasn't clear why this is now set on IVBs (though I don't see them on Intel whitebox IVBs). I'm also interested in the behaviour of CondChgd bit on Ivy Bridge processors. Do you know something about this behaviour? -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Hello, (2014/06/17 0:30), Don Zickus wrote: On Fri, Jun 13, 2014 at 05:44:37PM +0900, HATAYAMA Daisuke wrote: Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. As I said in a previous email, I ran into a similar problem and was going to solve it by zeroing out all the registers on init (which probably would have upset Peter :-) ). This is a smaller solution and seems ok. The only downside is it is called in the nmi handler. I am working with our customer to try and talk with Intel why this bit is set to begin with. Our customer says their BIOS doesn't use the PMU during boot so it wasn't clear why this is now set on IVBs (though I don't see them on Intel whitebox IVBs). I'm also interested in the behaviour of CondChgd bit on Ivy Bridge processors. Do you know something about this behaviour? -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
This is a resend patch from the previous v2 patch since there's no response except for Don. ChangeLog v2 => v3 - Add Acked-by by Don Zickus - Rebase the patch based on v3.16-rc2 - Rewrite patch description a little v1 => v2) - Add reason and the fact I know on CondChgd in patch description From 63d6fa2a3dfa9ff7d1710c85d8d020cdc37f3b49 Mon Sep 17 00:00:00 2001 From: HATAYAMA Daisuke Date: Wed, 25 Jun 2014 10:09:07 +0900 Subject: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke Acked-by: Don Zickus --- arch/x86/kernel/cpu/perf_event_intel.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *))) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *))) { -- 1.9.3 N§²ζμrΈyϊθΨb²X¬ΆΗ§vΨ^)ήΊ{.nΗ+·₯{±κηzX§Ά‘ά¨}©²Ζ zΪ:+v¨Ύ«κηzZ+Κ+zf£’·h§~Ϋi�ϋΰzΉ�w₯’Έ?¨θΪ&’)ί’fω^jΗ«y§m α@A«aΆΪ� 0Άμh�εi
[PATCH v3] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
This is a resend patch from the previous v2 patch since there's no response except for Don. ChangeLog v2 = v3 - Add Acked-by by Don Zickus - Rebase the patch based on v3.16-rc2 - Rewrite patch description a little v1 = v2) - Add reason and the fact I know on CondChgd in patch description From 63d6fa2a3dfa9ff7d1710c85d8d020cdc37f3b49 Mon Sep 17 00:00:00 2001 From: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Date: Wed, 25 Jun 2014 10:09:07 +0900 Subject: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: Don Zickus dzic...@redhat.com --- arch/x86/kernel/cpu/perf_event_intel.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *)status)) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *)status)) { -- 1.9.3 N§²ζμrΈyϊθΨb²X¬ΆΗ§vΨ^)ήΊ{.nΗ+·₯{±κηzX§Ά‘ά¨}©²Ζ zΪj:+v¨Ύ«κηzZ+Κ+zf£’·h§~Ϋi�ϋΰzΉ�w₯’Έ?¨θΪ’)ί’fω^jΗ«y§m α@A«aΆΪ� 0Άμh�εi
[PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke --- arch/x86/kernel/cpu/perf_event_intel.c |9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *))) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *))) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performancmonitoring hardware In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/cpu/perf_event_intel.c |9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *)status)) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *)status)) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Peter Zijlstra Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Wed, 11 Jun 2014 13:54:13 +0200 > On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote: >> > I'm not sure about exact behavior of CondChgd bit, in particular when >> > this bit is set. Although I read Intel System Programmer's Manual to >> > figure out but I have yet completed that. At least, I think ignoring >> > CondChgd bit should be enough for NMI watchdog perspective. >> >> So yes, the SDM lists the bit as existing but never once mentions it >> outside of that, and its been doing that at least back to 2008. >> >> Ooh, I found it: >> >> "The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to >> indicate changes to the state of performance monitoring hardware (see >> Figure 18-29)." >> >> Which is of course completely useless, not to mention inconsistent with >> the later CondChgd name. >> >> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels >> like having that set on boot? > > Matt found in the MSR listing for GLOBAL_STATUS: > > 63 CondChg: status bits of this register has changed. If CPUID.0AH: EAX[7:0] > > 0 > > Which brings us to a grand total of 3 different names for this bit. > > If it indeed does what it says on the tin, set every time the status > changes its like the most useless bit ever and I wonder why people > bothered to spend silicon on it. > Yes, I didn't mention in patch description this, but I reached the same conclusion. The description confuses me because the desciption and the behaviour of CondChg bit I see on the actual system is not coincide. Also, I checked cpuid on the system with Neharlem processor where I have never seen CondChg bit is set. [root@localhost ~]# ./cpuid -r CPU 0: 0x 0x00: eax=0x000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x0001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff 0x000a 0x00: eax=0x07300403 ebx=0x0044 ecx=0x edx=0x0603 ^^ So, cpuid tells that CondChg bit is supported on this processor, too. > In any case, the proposed patch seems fine, just needs a better > changelog. > I see. I'll write that the problem is that any NMI could be robbed by NMI watchdog explicitly. Now only patch title says this explicitly. This is your first comment. About CondChgd bit, I cannot write more than I see on actual system. If it's necessary to describe more about CondChgd bit, it would be appreciated if someone tell me more information about it. Thanks. HATAYAMA, Daisuke N§²ζμrΈyϊθΨb²X¬ΆΗ§vΨ^)ήΊ{.nΗ+·₯{±κηzX§Ά‘ά¨}©²Ζ zΪ:+v¨Ύ«κηzZ+Κ+zf£’·h§~Ϋi�ϋΰzΉ�w₯’Έ?¨θΪ&’)ί’fω^jΗ«y§m α@A«aΆΪ� 0Άμh�εi
Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Peter Zijlstra Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Wed, 11 Jun 2014 10:54:48 +0200 > On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote: >> Currently, a NMI handler for NMI watchdog may falsely handle any NMI >> signaled for different purpose if CondChgd bit in >> MSR_CORE_PERF_GLOBAL_STATUS MSR is set. >> >> This commit deals with the issue simply by ignoring CondChgd bit. >> >> Here is explanation in detail. >> >> On x86 NMI watchdog uses performance monitoring feature to >> periodically signal NMI each time performance counter gets overflowed. >> >> intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI >> handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner >> of a given NMI by looking at overflow status bits in >> MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it >> handles the given NMI as its own NMI. >> >> The problem is that intel_pmu_handle_irq() doesn't distinguish >> CondChgd bit from other bits. Unlike the other status bits, CondChgd >> bit doesn't represent overflow status for performance counters. Thus, >> CondChgd bit cannot be thought of as a mark indicating a given NMI is >> NMI watchdog's. > > So what was the problem? It ate another NMI? > Yes. The handler for NMI watchdog is called in NMI_LOCAL. This is done in earlier timing than NMI_UNKNOWN, and if some of the handlers in NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are not called. Like this: handled = nmi_handle(NMI_LOCAL, regs, b2b); __this_cpu_add(nmi_stats.normal, handled); if (handled) { /* * There are cases when a NMI handler handles multiple * events in the current NMI. One of these events may * be queued for in the next NMI. Because the event is * already handled, the next NMI will result in an unknown * NMI. Instead lets flag this for a potential NMI to * swallow. */ if (handled > 1) __this_cpu_write(swallow_nmi, true); return; } For example, we use unknown NMI to make system panic by enabling kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by NMI handler for NMI watchdog. >> I noticed this behavior on systems with Ivy Bridge processors: Intel >> Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, >> CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set >> in the beginning at boot. (then the CondChgd bit is cleared by next >> wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.) >> >> On the other hand, on older processors such as Nehalem, CondChgd bit >> is not set in the beginning at boot. >> >> I'm not sure about exact behavior of CondChgd bit, in particular when >> this bit is set. Although I read Intel System Programmer's Manual to >> figure out but I have yet completed that. At least, I think ignoring >> CondChgd bit should be enough for NMI watchdog perspective. > > So yes, the SDM lists the bit as existing but never once mentions it > outside of that, and its been doing that at least back to 2008. > > Ooh, I found it: > > "The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to > indicate changes to the state of performance monitoring hardware (see > Figure 18-29)." > > Which is of course completely useless, not to mention inconsistent with > the later CondChgd name. > > HPA, can you explain wtf that bit does and why hatayama-san's ivb feels > like having that set on boot? -- Thanks. HATAYAMA, Daisuke
Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Peter Zijlstra pet...@infradead.org Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Wed, 11 Jun 2014 10:54:48 +0200 On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote: Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. So what was the problem? It ate another NMI? Yes. The handler for NMI watchdog is called in NMI_LOCAL. This is done in earlier timing than NMI_UNKNOWN, and if some of the handlers in NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are not called. Like this: handled = nmi_handle(NMI_LOCAL, regs, b2b); __this_cpu_add(nmi_stats.normal, handled); if (handled) { /* * There are cases when a NMI handler handles multiple * events in the current NMI. One of these events may * be queued for in the next NMI. Because the event is * already handled, the next NMI will result in an unknown * NMI. Instead lets flag this for a potential NMI to * swallow. */ if (handled 1) __this_cpu_write(swallow_nmi, true); return; } For example, we use unknown NMI to make system panic by enabling kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by NMI handler for NMI watchdog. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. (then the CondChgd bit is cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.) On the other hand, on older processors such as Nehalem, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out but I have yet completed that. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. So yes, the SDM lists the bit as existing but never once mentions it outside of that, and its been doing that at least back to 2008. Ooh, I found it: The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performance monitoring hardware (see Figure 18-29). Which is of course completely useless, not to mention inconsistent with the later CondChgd name. HPA, can you explain wtf that bit does and why hatayama-san's ivb feels like having that set on boot? -- Thanks. HATAYAMA, Daisuke
Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
From: Peter Zijlstra pet...@infradead.org Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling Date: Wed, 11 Jun 2014 13:54:13 +0200 On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote: I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out but I have yet completed that. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. So yes, the SDM lists the bit as existing but never once mentions it outside of that, and its been doing that at least back to 2008. Ooh, I found it: The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to indicate changes to the state of performance monitoring hardware (see Figure 18-29). Which is of course completely useless, not to mention inconsistent with the later CondChgd name. HPA, can you explain wtf that bit does and why hatayama-san's ivb feels like having that set on boot? Matt found in the MSR listing for GLOBAL_STATUS: 63 CondChg: status bits of this register has changed. If CPUID.0AH: EAX[7:0] 0 Which brings us to a grand total of 3 different names for this bit. If it indeed does what it says on the tin, set every time the status changes its like the most useless bit ever and I wonder why people bothered to spend silicon on it. Yes, I didn't mention in patch description this, but I reached the same conclusion. The description confuses me because the desciption and the behaviour of CondChg bit I see on the actual system is not coincide. Also, I checked cpuid on the system with Neharlem processor where I have never seen CondChg bit is set. [root@localhost ~]# ./cpuid -r CPU 0: 0x 0x00: eax=0x000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x0001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff snip 0x000a 0x00: eax=0x07300403 ebx=0x0044 ecx=0x edx=0x0603 ^^ So, cpuid tells that CondChg bit is supported on this processor, too. In any case, the proposed patch seems fine, just needs a better changelog. I see. I'll write that the problem is that any NMI could be robbed by NMI watchdog explicitly. Now only patch title says this explicitly. This is your first comment. About CondChgd bit, I cannot write more than I see on actual system. If it's necessary to describe more about CondChgd bit, it would be appreciated if someone tell me more information about it. Thanks. HATAYAMA, Daisuke N§²ζμrΈyϊθΨb²X¬ΆΗ§vΨ^)ήΊ{.nΗ+·₯{±κηzX§Ά‘ά¨}©²Ζ zΪj:+v¨Ύ«κηzZ+Κ+zf£’·h§~Ϋi�ϋΰzΉ�w₯’Έ?¨θΪ’)ί’fω^jΗ«y§m α@A«aΆΪ� 0Άμh�εi
[PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. (then the CondChgd bit is cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.) On the other hand, on older processors such as Nehalem, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out but I have yet completed that. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke --- arch/x86/kernel/cpu/perf_event_intel.c |9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *))) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *))) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Currently, a NMI handler for NMI watchdog may falsely handle any NMI signaled for different purpose if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. (then the CondChgd bit is cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.) On the other hand, on older processors such as Nehalem, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out but I have yet completed that. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/cpu/perf_event_intel.c |9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..07846d7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1382,6 +1382,15 @@ again: intel_pmu_lbr_read(); /* +* CondChgd bit 63 doesn't mean any overflow status. Ignore +* and clear the bit. +*/ + if (__test_and_clear_bit(63, (unsigned long *)status)) { + if (!status) + goto done; + } + + /* * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *)status)) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process
(2014/01/26 13:02), David Rientjes wrote: > On Thu, 16 Jan 2014, HATAYAMA Daisuke wrote: > >> Hello, >> >> This patch deals with the issue of handling boot_cpu_physical_apicid >> in boot process I avoided in disable_cpu_apicid patch because I >> cannot guess how long it needs to take for the review of this fix. >> >> This patch is made on top of today's x86/apic branch of tip tree. >> Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb >> > > This breaks with SGI 320/540: > > arch/x86/platform/visws/visws_quirks.c: In function ‘MP_processor_info’: > arch/x86/platform/visws/visws_quirks.c:169:3: error: > ‘bios_cpu_physical_apicid’ undeclared (first use in this function) > arch/x86/platform/visws/visws_quirks.c:169:3: note: each undeclared > identifier is reported only once for each function it appears in > Thanks. I'll retry with CONFIG_X86_VISWS... obj-$(CONFIG_X86_VISWS) += visws_quirks.o > It makes it pretty apparent that you want a different name for the new > variable, bios_bsp_physical_apicid is just too close to > boot_cpu_physical_apicid that even the author of the patch missed a > conversion. > ``bsp'' and ``boot cpu'' are different, but certainly they are close for most of people who don't need to focus the difference. One idea is not to add bios_bsp_physical_apicid, just removing assignment to boot_cpu_physical_apicid from MP_processor_info(). I guess currently no one uses apicid provided by MP table since boot_cpu_physical_apicid is finally initialized in init_apic_mappings(). > (It's also really sad that we can't make it __initdata since it's only > valid at boot because of the hotplug stuff.) > -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process
(2014/01/26 13:02), David Rientjes wrote: On Thu, 16 Jan 2014, HATAYAMA Daisuke wrote: Hello, This patch deals with the issue of handling boot_cpu_physical_apicid in boot process I avoided in disable_cpu_apicid patch because I cannot guess how long it needs to take for the review of this fix. This patch is made on top of today's x86/apic branch of tip tree. Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb This breaks with SGI 320/540: arch/x86/platform/visws/visws_quirks.c: In function ‘MP_processor_info’: arch/x86/platform/visws/visws_quirks.c:169:3: error: ‘bios_cpu_physical_apicid’ undeclared (first use in this function) arch/x86/platform/visws/visws_quirks.c:169:3: note: each undeclared identifier is reported only once for each function it appears in Thanks. I'll retry with CONFIG_X86_VISWS... obj-$(CONFIG_X86_VISWS) += visws_quirks.o It makes it pretty apparent that you want a different name for the new variable, bios_bsp_physical_apicid is just too close to boot_cpu_physical_apicid that even the author of the patch missed a conversion. ``bsp'' and ``boot cpu'' are different, but certainly they are close for most of people who don't need to focus the difference. One idea is not to add bios_bsp_physical_apicid, just removing assignment to boot_cpu_physical_apicid from MP_processor_info(). I guess currently no one uses apicid provided by MP table since boot_cpu_physical_apicid is finally initialized in init_apic_mappings(). (It's also really sad that we can't make it __initdata since it's only valid at boot because of the hotplug stuff.) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process
Hello, This patch deals with the issue of handling boot_cpu_physical_apicid in boot process I avoided in disable_cpu_apicid patch because I cannot guess how long it needs to take for the review of this fix. This patch is made on top of today's x86/apic branch of tip tree. Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb I tested this patch in each combination of the following table: BIOS tableboot cpu -+-- ACPI MADT AP MP Table BSP >From 10946038252fc91f3ae2740953b2484ea0076ed4 Mon Sep 17 00:00:00 2001 From: HATAYAMA Daisuke Date: Thu, 16 Jan 2014 15:47:25 +0900 Subject: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process boot_cpu_physical_apicid variable is intended to have the apicid of the processor that is doing the boot up by read_apic_id() that picks up the value from APIC_ID register of local APIC. boot_cpu_physical_apicid variable could be initialized with the value from MP table if there's no acpi support on the system or acpi=off is manually specified. The first issue is that if kexec boots the 2nd kernel on some AP, the processor that is doing the boot up of the 2nd kernel is no longer BSP. This means that boot_cpu_physical_apicid is modified to unintended BSP's apicid. The second issue is that in the current code, once the boot_cpu_physical_apicid variable is modified by the value from MP table, it is again modified back to read_apic_id(). Concretely, they are processed as follows in setup_arch(): 1) boot_cpu_physical_apicid is first initialized by read_apic_id() in acpi_boot_init(), then 2) re-initialized by get_smp_config() and finally 3) modified back by read_apic_id() in init_apic_mappings(). In source code: /* * Read APIC and some other early information from ACPI tables. */ acpi_boot_init(); <== 1) sfi_init(); x86_dtb_init(); /* * get boot-time SMP configuration: */ if (smp_found_config) get_smp_config(); <== 2) prefill_possible_map(); init_cpu_to_node(); init_apic_mappings(); <== 3) It is confusing that meaning of the variable changes in the middle of boot up. To fix these issues, this patch adds bios_bsp_physical_apicid variable to store the apicid value reported by BIOS via ACPI MADT or MP table, by which boot_cpu_physical_apicid variable is cleanly handled in boot up. This patch also fixes the workaround done in generic_processor_info() that directly calls read_apic_id() to avoid the first issue. Signed-off-by: HATAYAMA Daisuke --- arch/x86/include/asm/mpspec.h | 1 + arch/x86/kernel/acpi/boot.c| 8 arch/x86/kernel/apic/apic.c| 31 ++- arch/x86/kernel/mpparse.c | 2 +- arch/x86/platform/visws/visws_quirks.c | 2 +- 5 files changed, 21 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index 3142a94..724853f 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -47,6 +47,7 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES]; extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); extern unsigned int boot_cpu_physical_apicid; +extern unsigned int bios_bsp_physical_apicid; extern unsigned int max_physical_apicid; extern int mpc_default_type; extern unsigned long mp_lapic_addr; diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 6c0b43b..4f460e2 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -213,6 +213,14 @@ static int acpi_register_lapic(int id, u8 enabled) if (boot_cpu_physical_apicid != -1U) ver = apic_version[boot_cpu_physical_apicid]; + /* +* ACPI specification describes that platform firmware should +* list the BSP processor as the first LAPIC entry in the +* MADT. +*/ + if (!num_processors && !disabled_cpus) + bios_bsp_physical_apicid = id; + return generic_processor_info(id, ver); } diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 7f26c9a..107614f 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -64,6 +64,15 @@ unsigned disabled_cpus; unsigned int boot_cpu_physical_apicid = -1U; EXPORT_SYMBOL_GPL(boot_cpu_physical_apicid); +/* Processor with BP flag on IP32_APIC_BASE MSR to be initialized with + * the value reported by BIOS tables such as ACPI MADT or MP table. + * + * If kexec boots the 2nd kernel on some AP, this differs from the + * processor that is doing the boot up: boot_cpu_physical_apicid != + * bios_bsp_physical_apicid. + */ +unsigned int bios_bsp_physical_apicid = BAD_APICID; + /* * The highest APIC ID seen during enumeration. */ @@ -2120,28 +2129,8 @@ int generic_processor_info(int
[PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process
Hello, This patch deals with the issue of handling boot_cpu_physical_apicid in boot process I avoided in disable_cpu_apicid patch because I cannot guess how long it needs to take for the review of this fix. This patch is made on top of today's x86/apic branch of tip tree. Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb I tested this patch in each combination of the following table: BIOS tableboot cpu -+-- ACPI MADT AP MP Table BSP From 10946038252fc91f3ae2740953b2484ea0076ed4 Mon Sep 17 00:00:00 2001 From: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Date: Thu, 16 Jan 2014 15:47:25 +0900 Subject: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process boot_cpu_physical_apicid variable is intended to have the apicid of the processor that is doing the boot up by read_apic_id() that picks up the value from APIC_ID register of local APIC. boot_cpu_physical_apicid variable could be initialized with the value from MP table if there's no acpi support on the system or acpi=off is manually specified. The first issue is that if kexec boots the 2nd kernel on some AP, the processor that is doing the boot up of the 2nd kernel is no longer BSP. This means that boot_cpu_physical_apicid is modified to unintended BSP's apicid. The second issue is that in the current code, once the boot_cpu_physical_apicid variable is modified by the value from MP table, it is again modified back to read_apic_id(). Concretely, they are processed as follows in setup_arch(): 1) boot_cpu_physical_apicid is first initialized by read_apic_id() in acpi_boot_init(), then 2) re-initialized by get_smp_config() and finally 3) modified back by read_apic_id() in init_apic_mappings(). In source code: /* * Read APIC and some other early information from ACPI tables. */ acpi_boot_init(); == 1) sfi_init(); x86_dtb_init(); /* * get boot-time SMP configuration: */ if (smp_found_config) get_smp_config(); == 2) prefill_possible_map(); init_cpu_to_node(); init_apic_mappings(); == 3) It is confusing that meaning of the variable changes in the middle of boot up. To fix these issues, this patch adds bios_bsp_physical_apicid variable to store the apicid value reported by BIOS via ACPI MADT or MP table, by which boot_cpu_physical_apicid variable is cleanly handled in boot up. This patch also fixes the workaround done in generic_processor_info() that directly calls read_apic_id() to avoid the first issue. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/include/asm/mpspec.h | 1 + arch/x86/kernel/acpi/boot.c| 8 arch/x86/kernel/apic/apic.c| 31 ++- arch/x86/kernel/mpparse.c | 2 +- arch/x86/platform/visws/visws_quirks.c | 2 +- 5 files changed, 21 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index 3142a94..724853f 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -47,6 +47,7 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES]; extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); extern unsigned int boot_cpu_physical_apicid; +extern unsigned int bios_bsp_physical_apicid; extern unsigned int max_physical_apicid; extern int mpc_default_type; extern unsigned long mp_lapic_addr; diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 6c0b43b..4f460e2 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -213,6 +213,14 @@ static int acpi_register_lapic(int id, u8 enabled) if (boot_cpu_physical_apicid != -1U) ver = apic_version[boot_cpu_physical_apicid]; + /* +* ACPI specification describes that platform firmware should +* list the BSP processor as the first LAPIC entry in the +* MADT. +*/ + if (!num_processors !disabled_cpus) + bios_bsp_physical_apicid = id; + return generic_processor_info(id, ver); } diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 7f26c9a..107614f 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -64,6 +64,15 @@ unsigned disabled_cpus; unsigned int boot_cpu_physical_apicid = -1U; EXPORT_SYMBOL_GPL(boot_cpu_physical_apicid); +/* Processor with BP flag on IP32_APIC_BASE MSR to be initialized with + * the value reported by BIOS tables such as ACPI MADT or MP table. + * + * If kexec boots the 2nd kernel on some AP, this differs from the + * processor that is doing the boot up: boot_cpu_physical_apicid != + * bios_bsp_physical_apicid. + */ +unsigned int bios_bsp_physical_apicid = BAD_APICID; + /* * The highest APIC ID seen during enumeration. */ @@ -2120,28 +2129,8 @@ int
Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos
(2014/01/16 14:53), H. Peter Anvin wrote: On 01/15/2014 08:44 PM, HATAYAMA Daisuke wrote: This is not typo in my intention. generic_processor_info() has two more cases where it ignores cpus. In either cases, printed messages are tagged with "ACPI" because this function is called when parsing ACPI MADT table in acpi_boot_init(); this function is also being used to parse other kind of tables but the "ACPI" tag would mean that the function was first for ACPI only. But it has nothing to do with ACPI -- it is an APIC ID from the command line -- so that would be actively misleading. -hpa I never disagree to the fix itself. I wanted to explain why I wrote so. I thought it was better to unify tags in the same function because they should bleong to the same component, here I mean ACPI, but it's better to avoid the confusion, which is bigger impact. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos
(2014/01/16 6:09), tip-bot for H. Peter Anvin wrote: Commit-ID: 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb Gitweb: http://git.kernel.org/tip/5b4d1dbc24bb6fd7179ada0f47be34e27e64decb Author: H. Peter Anvin AuthorDate: Wed, 15 Jan 2014 13:02:08 -0800 Committer: H. Peter Anvin CommitDate: Wed, 15 Jan 2014 13:02:08 -0800 x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos Make disabled_cpu_apicid static and read_mostly, and fix a couple of typos. Reported-by: Ingo Molnar Link: http://lkml.kernel.org/r/20140115182511.ga22...@gmail.com Signed-off-by: H. Peter Anvin Cc: HATAYAMA Daisuke --- arch/x86/kernel/apic/apic.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index e78ab8c..7f26c9a 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -79,7 +79,7 @@ physid_mask_t phys_cpu_present_map; * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to * avoid undefined behaviour caused by sending INIT from AP to BSP. */ -unsigned int disabled_cpu_apicid = BAD_APICID; +static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID; /* * Map cpu index to physical APIC ID @@ -2124,7 +2124,7 @@ int generic_processor_info(int apicid, int version) * boot_cpu_physical_apicid is designed to have the apicid * returned by read_apic_id(), i.e, the apicid of the * currently booting-up processor. However, on some platforms, -* it is temporarilly modified by the apicid reported as BSP +* it is temporarily modified by the apicid reported as BSP * through MP table. Concretely: * * - arch/x86/kernel/mpparse.c: MP_processor_info() @@ -2145,7 +2145,7 @@ int generic_processor_info(int apicid, int version) disabled_cpu_apicid == apicid) { int thiscpu = num_processors + disabled_cpus; - pr_warning("ACPI: Disabling requested cpu." + pr_warning("APIC: Disabling requested cpu." " Processor %d/0x%x ignored.\n", thiscpu, apicid); This is not typo in my intention. generic_processor_info() has two more cases where it ignores cpus. In either cases, printed messages are tagged with "ACPI" because this function is called when parsing ACPI MADT table in acpi_boot_init(); this function is also being used to parse other kind of tables but the "ACPI" tag would mean that the function was first for ACPI only. int generic_processor_info(int apicid, int version) { int cpu, max = nr_cpu_ids; bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid, phys_cpu_present_map); /* * If boot cpu has not been detected yet, then only allow upto * nr_cpu_ids - 1 processors and keep one slot free for boot cpu */ if (!boot_cpu_detected && num_processors >= nr_cpu_ids - 1 && apicid != boot_cpu_physical_apicid) { int thiscpu = max + disabled_cpus - 1; pr_warning( "ACPI: NR_CPUS/possible_cpus limit of %i almost" " reached. Keeping one slot for boot cpu." " Processor %d/0x%x ignored.\n", max, thiscpu, apicid); disabled_cpus++; return -ENODEV; } if (num_processors >= nr_cpu_ids) { int thiscpu = max + disabled_cpus; pr_warning( "ACPI: NR_CPUS/possible_cpus limit of %i reached." " Processor %d/0x%x ignored.\n", max, thiscpu, apicid); disabled_cpus++; return -EINVAL; } -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/apic] x86, apic, kexec: Add disable_cpu_apicid kernel parameter
Commit-ID: 151e0c7de616310f95393d9306903900fcd8b277 Gitweb: http://git.kernel.org/tip/151e0c7de616310f95393d9306903900fcd8b277 Author: HATAYAMA Daisuke AuthorDate: Wed, 15 Jan 2014 15:44:58 +0900 Committer: H. Peter Anvin CommitDate: Wed, 15 Jan 2014 09:19:20 -0800 x86, apic, kexec: Add disable_cpu_apicid kernel parameter Add disable_cpu_apicid kernel parameter. To use this kernel parameter, specify an initial APIC ID of the corresponding CPU you want to disable. This is mostly used for the kdump 2nd kernel to disable BSP to wake up multiple CPUs without causing system reset or hang due to sending INIT from AP to BSP. Kdump users first figure out initial APIC ID of the BSP, CPU0 in the 1st kernel, for example from /proc/cpuinfo and then set up this kernel parameter for the 2nd kernel using the obtained APIC ID. However, doing this procedure at each boot time manually is awkward, which should be automatically done by user-land service scripts, for example, kexec-tools on fedora/RHEL distributions. This design is more flexible than disabling BSP in kernel boot time automatically in that in kernel boot time we have no choice but referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning that the method is not applicable to the systems without such BIOS tables. One assumption behind this design is that users get initial APIC ID of the BSP in still healthy state and so BSP is uniquely kept in CPU0. Thus, through the kernel parameter, only one initial APIC ID can be specified. In a comparison with disabled_cpu_apicid, we use read_apic_id(), not boot_cpu_physical_apicid, because on some platforms, the variable is modified to the apicid reported as BSP through MP table and this function is executed with the temporarily modified boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel parameter doesn't work well for apicids of APs. Fixing the wrong handling of boot_cpu_physical_apicid requires some reviews and tests beyond some platforms and it could take some time. The fix here is a kind of workaround to focus on the main topic of this patch. Signed-off-by: HATAYAMA Daisuke Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6 Signed-off-by: H. Peter Anvin --- Documentation/kernel-parameters.txt | 9 +++ arch/x86/kernel/apic/apic.c | 49 + 2 files changed, 58 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b9e9bd8..6fdbf8c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. disable=[IPV6] See Documentation/networking/ipv6.txt. + disable_cpu_apicid= [X86,APIC,SMP] + Format: + The number of initial APIC ID for the + corresponding CPU to be disabled at boot, + mostly used for the kdump 2nd kernel to + disable BSP to wake up multiple CPUs without + causing system reset or hang due to sending + INIT from AP to BSP. + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 4ec1dd6..e78ab8c 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -75,6 +75,13 @@ unsigned int max_physical_apicid; physid_mask_t phys_cpu_present_map; /* + * Processor to be disabled specified by kernel parameter + * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to + * avoid undefined behaviour caused by sending INIT from AP to BSP. + */ +unsigned int disabled_cpu_apicid = BAD_APICID; + +/* * Map cpu index to physical APIC ID */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID); @@ -2114,6 +2121,39 @@ int generic_processor_info(int apicid, int version) phys_cpu_present_map); /* +* boot_cpu_physical_apicid is designed to have the apicid +* returned by read_apic_id(), i.e, the apicid of the +* currently booting-up processor. However, on some platforms, +* it is temporarilly modified by the apicid reported as BSP +* through MP table. Concretely: +* +* - arch/x86/kernel/mpparse.c: MP_processor_info() +* - arch/x86/mm/amdtopology.c: amd_numa_init() +* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info() +* +* This function is executed with the modified +* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel +* parameter doesn't work to disable APs on kdump 2nd kernel. +* +* Since fixing handling
[tip:x86/apic] x86, apic, kexec: Add disable_cpu_apicid kernel parameter
Commit-ID: 151e0c7de616310f95393d9306903900fcd8b277 Gitweb: http://git.kernel.org/tip/151e0c7de616310f95393d9306903900fcd8b277 Author: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com AuthorDate: Wed, 15 Jan 2014 15:44:58 +0900 Committer: H. Peter Anvin h...@zytor.com CommitDate: Wed, 15 Jan 2014 09:19:20 -0800 x86, apic, kexec: Add disable_cpu_apicid kernel parameter Add disable_cpu_apicid kernel parameter. To use this kernel parameter, specify an initial APIC ID of the corresponding CPU you want to disable. This is mostly used for the kdump 2nd kernel to disable BSP to wake up multiple CPUs without causing system reset or hang due to sending INIT from AP to BSP. Kdump users first figure out initial APIC ID of the BSP, CPU0 in the 1st kernel, for example from /proc/cpuinfo and then set up this kernel parameter for the 2nd kernel using the obtained APIC ID. However, doing this procedure at each boot time manually is awkward, which should be automatically done by user-land service scripts, for example, kexec-tools on fedora/RHEL distributions. This design is more flexible than disabling BSP in kernel boot time automatically in that in kernel boot time we have no choice but referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning that the method is not applicable to the systems without such BIOS tables. One assumption behind this design is that users get initial APIC ID of the BSP in still healthy state and so BSP is uniquely kept in CPU0. Thus, through the kernel parameter, only one initial APIC ID can be specified. In a comparison with disabled_cpu_apicid, we use read_apic_id(), not boot_cpu_physical_apicid, because on some platforms, the variable is modified to the apicid reported as BSP through MP table and this function is executed with the temporarily modified boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel parameter doesn't work well for apicids of APs. Fixing the wrong handling of boot_cpu_physical_apicid requires some reviews and tests beyond some platforms and it could take some time. The fix here is a kind of workaround to focus on the main topic of this patch. Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6 Signed-off-by: H. Peter Anvin h...@zytor.com --- Documentation/kernel-parameters.txt | 9 +++ arch/x86/kernel/apic/apic.c | 49 + 2 files changed, 58 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b9e9bd8..6fdbf8c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. disable=[IPV6] See Documentation/networking/ipv6.txt. + disable_cpu_apicid= [X86,APIC,SMP] + Format: int + The number of initial APIC ID for the + corresponding CPU to be disabled at boot, + mostly used for the kdump 2nd kernel to + disable BSP to wake up multiple CPUs without + causing system reset or hang due to sending + INIT from AP to BSP. + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 4ec1dd6..e78ab8c 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -75,6 +75,13 @@ unsigned int max_physical_apicid; physid_mask_t phys_cpu_present_map; /* + * Processor to be disabled specified by kernel parameter + * disable_cpu_apicid=int, mostly used for the kdump 2nd kernel to + * avoid undefined behaviour caused by sending INIT from AP to BSP. + */ +unsigned int disabled_cpu_apicid = BAD_APICID; + +/* * Map cpu index to physical APIC ID */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID); @@ -2114,6 +2121,39 @@ int generic_processor_info(int apicid, int version) phys_cpu_present_map); /* +* boot_cpu_physical_apicid is designed to have the apicid +* returned by read_apic_id(), i.e, the apicid of the +* currently booting-up processor. However, on some platforms, +* it is temporarilly modified by the apicid reported as BSP +* through MP table. Concretely: +* +* - arch/x86/kernel/mpparse.c: MP_processor_info() +* - arch/x86/mm/amdtopology.c: amd_numa_init() +* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info() +* +* This function is executed with the modified +* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel +* parameter doesn't work to disable APs
Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos
(2014/01/16 6:09), tip-bot for H. Peter Anvin wrote: Commit-ID: 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb Gitweb: http://git.kernel.org/tip/5b4d1dbc24bb6fd7179ada0f47be34e27e64decb Author: H. Peter Anvin h...@zytor.com AuthorDate: Wed, 15 Jan 2014 13:02:08 -0800 Committer: H. Peter Anvin h...@zytor.com CommitDate: Wed, 15 Jan 2014 13:02:08 -0800 x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos Make disabled_cpu_apicid static and read_mostly, and fix a couple of typos. Reported-by: Ingo Molnar mi...@kernel.org Link: http://lkml.kernel.org/r/20140115182511.ga22...@gmail.com Signed-off-by: H. Peter Anvin h...@linux.intel.com Cc: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com --- arch/x86/kernel/apic/apic.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index e78ab8c..7f26c9a 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -79,7 +79,7 @@ physid_mask_t phys_cpu_present_map; * disable_cpu_apicid=int, mostly used for the kdump 2nd kernel to * avoid undefined behaviour caused by sending INIT from AP to BSP. */ -unsigned int disabled_cpu_apicid = BAD_APICID; +static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID; /* * Map cpu index to physical APIC ID @@ -2124,7 +2124,7 @@ int generic_processor_info(int apicid, int version) * boot_cpu_physical_apicid is designed to have the apicid * returned by read_apic_id(), i.e, the apicid of the * currently booting-up processor. However, on some platforms, -* it is temporarilly modified by the apicid reported as BSP +* it is temporarily modified by the apicid reported as BSP * through MP table. Concretely: * * - arch/x86/kernel/mpparse.c: MP_processor_info() @@ -2145,7 +2145,7 @@ int generic_processor_info(int apicid, int version) disabled_cpu_apicid == apicid) { int thiscpu = num_processors + disabled_cpus; - pr_warning(ACPI: Disabling requested cpu. + pr_warning(APIC: Disabling requested cpu. Processor %d/0x%x ignored.\n, thiscpu, apicid); This is not typo in my intention. generic_processor_info() has two more cases where it ignores cpus. In either cases, printed messages are tagged with ACPI because this function is called when parsing ACPI MADT table in acpi_boot_init(); this function is also being used to parse other kind of tables but the ACPI tag would mean that the function was first for ACPI only. int generic_processor_info(int apicid, int version) { int cpu, max = nr_cpu_ids; bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid, phys_cpu_present_map); /* * If boot cpu has not been detected yet, then only allow upto * nr_cpu_ids - 1 processors and keep one slot free for boot cpu */ if (!boot_cpu_detected num_processors = nr_cpu_ids - 1 apicid != boot_cpu_physical_apicid) { int thiscpu = max + disabled_cpus - 1; pr_warning( ACPI: NR_CPUS/possible_cpus limit of %i almost reached. Keeping one slot for boot cpu. Processor %d/0x%x ignored.\n, max, thiscpu, apicid); disabled_cpus++; return -ENODEV; } if (num_processors = nr_cpu_ids) { int thiscpu = max + disabled_cpus; pr_warning( ACPI: NR_CPUS/possible_cpus limit of %i reached. Processor %d/0x%x ignored.\n, max, thiscpu, apicid); disabled_cpus++; return -EINVAL; } -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos
(2014/01/16 14:53), H. Peter Anvin wrote: On 01/15/2014 08:44 PM, HATAYAMA Daisuke wrote: This is not typo in my intention. generic_processor_info() has two more cases where it ignores cpus. In either cases, printed messages are tagged with ACPI because this function is called when parsing ACPI MADT table in acpi_boot_init(); this function is also being used to parse other kind of tables but the ACPI tag would mean that the function was first for ACPI only. But it has nothing to do with ACPI -- it is an APIC ID from the command line -- so that would be actively misleading. -hpa I never disagree to the fix itself. I wanted to explain why I wrote so. I thought it was better to unify tags in the same function because they should bleong to the same component, here I mean ACPI, but it's better to avoid the confusion, which is bigger impact. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH v10] x86, apic, kexec, Documentation: Add disable_cpu_apicid kernel parameter
Add disable_cpu_apicid kernel parameter. To use this kernel parameter, specify an initial APIC ID of the corresponding CPU you want to disable. This is mostly used for the kdump 2nd kernel to disable BSP to wake up multiple CPUs without causing system reset or hang due to sending INIT from AP to BSP. Kdump users first figure out initial APIC ID of the BSP, CPU0 in the 1st kernel, for example from /proc/cpuinfo and then set up this kernel parameter for the 2nd kernel using the obtained APIC ID. However, doing this procedure at each boot time manually is awkward, which should be automatically done by user-land service scripts, for example, kexec-tools on fedora/RHEL distributions. This design is more flexible than disabling BSP in kernel boot time automatically in that in kernel boot time we have no choice but referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning that the method is not applicable to the systems without such BIOS tables. One assumption behind this design is that users get initial APIC ID of the BSP in still healthy state and so BSP is uniquely kept in CPU0. Thus, through the kernel parameter, only one initial APIC ID can be specified. In a comparison with disabled_cpu_apicid, we use read_apic_id(), not boot_cpu_physical_apicid, because on some platforms, the variable is modified to the apicid reported as BSP through MP table and this function is executed with the temporarily modified boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel parameter doesn't work well for apicids of APs. Fixing the wrong handling of boot_cpu_physical_apicid requires some reviews and tests beyond some platforms and it could take some time. The fix here is a kind of workaround to focus on the main topic of this patch. Signed-off-by: HATAYAMA Daisuke --- Documentation/kernel-parameters.txt |9 ++ arch/x86/kernel/apic/apic.c | 49 +++ 2 files changed, 58 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 50680a5..4e5528c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. disable=[IPV6] See Documentation/networking/ipv6.txt. + disable_cpu_apicid= [X86,APIC,SMP] + Format: + The number of initial APIC ID for the + corresponding CPU to be disabled at boot, + mostly used for the kdump 2nd kernel to + disable BSP to wake up multiple CPUs without + causing system reset or hang due to sending + INIT from AP to BSP. + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index d278736..6c0b7d5 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -75,6 +75,13 @@ unsigned int max_physical_apicid; physid_mask_t phys_cpu_present_map; /* + * Processor to be disabled specified by kernel parameter + * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to + * avoid undefined behaviour caused by sending INIT from AP to BSP. + */ +unsigned int disabled_cpu_apicid = BAD_APICID; + +/* * Map cpu index to physical APIC ID */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID); @@ -2115,6 +2122,39 @@ int generic_processor_info(int apicid, int version) phys_cpu_present_map); /* +* boot_cpu_physical_apicid is designed to have the apicid +* returned by read_apic_id(), i.e, the apicid of the +* currently booting-up processor. However, on some platforms, +* it is temporarilly modified by the apicid reported as BSP +* through MP table. Concretely: +* +* - arch/x86/kernel/mpparse.c: MP_processor_info() +* - arch/x86/mm/amdtopology.c: amd_numa_init() +* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info() +* +* This function is executed with the modified +* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel +* parameter doesn't work to disable APs on kdump 2nd kernel. +* +* Since fixing handling of boot_cpu_physical_apicid requires +* another discussion and tests on each platform, we leave it +* for now and here we use read_apic_id() directly in this +* function, generic_processor_info(). +*/ + if (disabled_cpu_apicid != BAD_APICID && + disabled_cpu_apicid != read_apic_id() && + disabled_cpu_apicid == apicid) { + int thiscpu = num_processors + disabled_cpus; + +