RE: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation

2019-01-07 Thread Hatayama, Daisuke
Hi,

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Lianbo Jiang
> Sent: Monday, January 7, 2019 10:48 AM
> To: linux-kernel@vger.kernel.org
> Cc: ke...@lists.infradead.org; t...@linutronix.de; mi...@redhat.com;
> b...@alien8.de; x...@kernel.org; a...@linux-foundation.org; b...@redhat.com;
> dyo...@redhat.com; linux-...@vger.kernel.org; k-ha...@ab.jp.nec.com;
> ander...@redhat.com
> Subject: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation
> 
> This document lists some variables that export to vmcoreinfo, and briefly
> describles what these variables indicate. It should be instructive for
> many people who do not know the vmcoreinfo, and it also normalizes the

I agree to this part, but

> exported variables as a convention between kernel and use-space.

I don't agree to this part.

The meaning of each symbol is decided by each feature in the kernel,
not by vmcoreinfo. I suspect anyone mistakenly understand this document is
ABI enforcing each symbol works as described. We can change symbols and
their meaning regardless of this document.

Oh, I found this topic has already been discussed at v3, and
you removed "ABI" in the patch description at v4.

But it seems still confusing to me.
I think the explicit description saying that this is for user-land tools,
they treats each symbol as described,
and the document never affect implementation of each kernel components,
is necessary in e.g. "Purpose of this document" section?

> 
> Suggested-by: Borislav Petkov 
> Signed-off-by: Lianbo Jiang 
> ---
>  Documentation/kdump/vmcoreinfo.txt | 500 +
>  1 file changed, 500 insertions(+)
>  create mode 100644 Documentation/kdump/vmcoreinfo.txt
> 
> diff --git a/Documentation/kdump/vmcoreinfo.txt
> b/Documentation/kdump/vmcoreinfo.txt
> new file mode 100644
> index ..8e444586b87b
> --- /dev/null
> +++ b/Documentation/kdump/vmcoreinfo.txt
> @@ -0,0 +1,500 @@
> +
> + VMCOREINFO
> +
> +
> +===
> +What is the VMCOREINFO?
> +===
> +
> +VMCOREINFO is a special ELF note section. It contains various
> +information from the kernel like structure size, page size, symbol
> +values, field offsets, etc. These data are packed into an ELF note
> +section and used by user-space tools like crash and makedumpfile to
> +analyze a kernel's memory layout.
> +
> +
> +Common variables
> +
> +
> +init_uts_ns.name.release
> +
> +
> +The version of the Linux kernel. Used to find the corresponding source
> +code from which the kernel has been built.
> +
> +PAGE_SIZE
> +-
> +
> +The size of a page. It is the smallest unit of data for memory
> +management in kernel. It is usually 4096 bytes and a page is aligned
> +on 4096 bytes. Used for computing page addresses.
> +
> +init_uts_ns
> +---
> +
> +This is the UTS namespace, which is used to isolate two specific
> +elements of the system that relate to the uname(2) system call. The UTS
> +namespace is named after the data structure used to store information
> +returned by the uname(2) system call.
> +
> +User-space tools can get the kernel name, host name, kernel release
> +number, kernel version, architecture name and OS type from it.
> +
> +node_online_map
> +---
> +
> +An array node_states[N_ONLINE] which represents the set of online node
> +in a system, one bit position per node number. Used to keep track of
> +which nodes are in the system and online.
> +
> +swapper_pg_dir
> +-
> +
> +The global page directory pointer of the kernel. Used to translate
> +virtual to physical addresses.
> +
> +_stext
> +--
> +
> +Defines the beginning of the text section. In general, _stext indicates
> +the kernel start address. Used to convert a virtual address from the
> +direct kernel map to a physical address.
> +
> +vmap_area_list
> +--
> +
> +Stores the virtual area list. makedumpfile can get the vmalloc start
> +value from this variable. This value is necessary for vmalloc translation.
> +
> +mem_map
> +---
> +
> +Physical addresses are translated to struct pages by treating them as
> +an index into the mem_map array. Right-shifting a physical address
> +PAGE_SHIFT bits converts it into a page frame number which is an index
> +into that mem_map array.
> +
> +Used to map an address to the corresponding struct page.
> +
> +contig_page_data
> +
> +
> +Makedumpfile can get the pglist_data structure from this symbol, which
> +is used to describe the memory layout.
> +
> +User-space tools use this to exclude free pages when dumping memory.
> +
> +mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
> 

RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Tuesday, June 5, 2018 11:03 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> ebied...@xmission.com (Eric W. Biederman) writes:
> 
> > "Hatayama, Daisuke"  writes:
> >
> >>> Can you test this and please verify it fixes your issue?
> >>
> >> I tried this patch on top of v4.17 but the system fails to boot
> >> without detecting root disks by dracut like this:
> [snip]
> 
> >> OTOH, there's no issue on the pure v4.17 kernel.
> >>
> >> As above, ls /sys/module looks apparently good. But I guess any part of
> >> behavior of getdentries() on sysfs must have changed, affecting the disk
> >> detection...
> >
> > I haven't been able to reproduce this yet.  My test system boots fine.
> > Which fedora are you testing on?
> 
> I reproduced something similar and fedora 28.  So I think I have found
> and fixed the issue.  I believe I simply reversed the test at the end of
> kernfs_dir_pos. AKA "<" instead of ">".

Though too late, I used fedora 28 and RHEL7.5.

I applied this fix to your patch and the system boot was successfully done.

I'll start testing your patch from now on.

> 
> I am going to see if I can test my changes more throughly on this side
> and then repost.
> 
> 
> 
> >>>  fs/kernfs/dir.c | 109
> >>> ++--
> >>>  1 file changed, 67 insertions(+), 42 deletions(-)
> >>>
> >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >>> index 89d1dc19340b..8148b5fec48d 100644
> >>> --- a/fs/kernfs/dir.c
> >>> +++ b/fs/kernfs/dir.c
> >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >>> struct file *filp)
> >>>   return 0;
> >>>  }
> >>>
> >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >>> +{
> >>> + struct rb_node *node = rb_next(>rb);
> >>> + return node ? rb_to_kn(node) : NULL;
> >>> +}
> >>> +
> >>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >>>  {
> >>> - if (pos) {
> >>> - int valid = kernfs_active(pos) &&
> >>> - pos->parent == parent && hash == pos->hash;
> >>> - kernfs_put(pos);
> >>> - if (!valid)
> >>> - pos = NULL;
> >>> - }
> >>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >>> - struct rb_node *node = parent->dir.children.rb_node;
> >>> - while (node) {
> >>> - pos = rb_to_kn(node);
> >>> -
> >>> - if (hash < pos->hash)
> >>> - node = node->rb_left;
> >>> - else if (hash > pos->hash)
> >>> - node = node->rb_right;
> >>> - else
> >>> - break;
> >>> + struct kernfs_node *pos;
> >>> + struct rb_node *node;
> >>> + unsigned int hash;
> >>> + const char *name = "";
> >>> +
> >>> + /* Is off a valid name hash? */
> >>> + if ((off < 2) || (off >= INT_MAX))
> >>> + return NULL;
> >>> + hash = off;
> >>> +
> >>> + /* Is the saved position usable? */
> >>> + if (saved) {
> >>> + /* Proper parent and hash? */
> >>> + if ((parent != saved->parent) || (saved->hash != hash)) {
> >>> + saved = NULL;
> >>
> >> name is uninitialized in this path.
> >
> > It is.  name is initialized to "" see above.
> >
> >>> + } else {
> >>> + if (kernfs_active(saved))
> >>> + return saved;
> >>> + name = saved->name;
> >>>   }
> >>>   }
> >>&

RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Tuesday, June 5, 2018 11:03 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> ebied...@xmission.com (Eric W. Biederman) writes:
> 
> > "Hatayama, Daisuke"  writes:
> >
> >>> Can you test this and please verify it fixes your issue?
> >>
> >> I tried this patch on top of v4.17 but the system fails to boot
> >> without detecting root disks by dracut like this:
> [snip]
> 
> >> OTOH, there's no issue on the pure v4.17 kernel.
> >>
> >> As above, ls /sys/module looks apparently good. But I guess any part of
> >> behavior of getdentries() on sysfs must have changed, affecting the disk
> >> detection...
> >
> > I haven't been able to reproduce this yet.  My test system boots fine.
> > Which fedora are you testing on?
> 
> I reproduced something similar and fedora 28.  So I think I have found
> and fixed the issue.  I believe I simply reversed the test at the end of
> kernfs_dir_pos. AKA "<" instead of ">".

Though too late, I used fedora 28 and RHEL7.5.

I applied this fix to your patch and the system boot was successfully done.

I'll start testing your patch from now on.

> 
> I am going to see if I can test my changes more throughly on this side
> and then repost.
> 
> 
> 
> >>>  fs/kernfs/dir.c | 109
> >>> ++--
> >>>  1 file changed, 67 insertions(+), 42 deletions(-)
> >>>
> >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >>> index 89d1dc19340b..8148b5fec48d 100644
> >>> --- a/fs/kernfs/dir.c
> >>> +++ b/fs/kernfs/dir.c
> >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >>> struct file *filp)
> >>>   return 0;
> >>>  }
> >>>
> >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >>> +{
> >>> + struct rb_node *node = rb_next(>rb);
> >>> + return node ? rb_to_kn(node) : NULL;
> >>> +}
> >>> +
> >>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >>>  {
> >>> - if (pos) {
> >>> - int valid = kernfs_active(pos) &&
> >>> - pos->parent == parent && hash == pos->hash;
> >>> - kernfs_put(pos);
> >>> - if (!valid)
> >>> - pos = NULL;
> >>> - }
> >>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >>> - struct rb_node *node = parent->dir.children.rb_node;
> >>> - while (node) {
> >>> - pos = rb_to_kn(node);
> >>> -
> >>> - if (hash < pos->hash)
> >>> - node = node->rb_left;
> >>> - else if (hash > pos->hash)
> >>> - node = node->rb_right;
> >>> - else
> >>> - break;
> >>> + struct kernfs_node *pos;
> >>> + struct rb_node *node;
> >>> + unsigned int hash;
> >>> + const char *name = "";
> >>> +
> >>> + /* Is off a valid name hash? */
> >>> + if ((off < 2) || (off >= INT_MAX))
> >>> + return NULL;
> >>> + hash = off;
> >>> +
> >>> + /* Is the saved position usable? */
> >>> + if (saved) {
> >>> + /* Proper parent and hash? */
> >>> + if ((parent != saved->parent) || (saved->hash != hash)) {
> >>> + saved = NULL;
> >>
> >> name is uninitialized in this path.
> >
> > It is.  name is initialized to "" see above.
> >
> >>> + } else {
> >>> + if (kernfs_active(saved))
> >>> + return saved;
> >>> + name = saved->name;
> >>>   }
> >>>   }
> >>&

RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Monday, June 4, 2018 11:45 PM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> "Hatayama, Daisuke"  writes:
> 
> > Hello,
> >
> > Thanks for this patch, Eric.
> >
> >> -Original Message-
> >> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> >> Sent: Monday, June 4, 2018 3:51 AM
> >> To: Hatayama, Daisuke 
> >> Cc: 'gre...@linuxfoundation.org' ;
> >> 't...@kernel.org' ; Okajima, Toshiyuki
> >> ; linux-kernel@vger.kernel.org;
> >> 'ebied...@aristanetworks.com' 
> >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> >>
> >>
> >> Over the years two bugs have crept into the code that handled the last
> >> returned kernfs directory being deleted, and handled general seeking
> >> in kernfs directories.  The result is that reading the /sys/module
> >> directory while moduled are loading and unloading will sometimes
> >> skip over a module that was present for the entire duration of
> >> the directory read.
> >>
> >> The first bug is that kernfs_dir_next_pos was advancing the position
> >> in the directory even when kernfs_dir_pos had found the starting
> >> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> >> guaranteed to return the directory entry past the starting directory
> >> entry, making it so that advancing the directory position is wrong.
> >>
> >> The second bug is that kernfs_dir_pos when the saved kernfs_node
> >> did not validate, was not always returning a directory after
> >> the the target position, but was sometimes returning the directory
> >> entry just before the target position.
> >>
> >> To correct these issues and to make the code easily readable and
> >> comprehensible several cleanups are made.
> >>
> >> - A function kernfs_dir_next is factored out to make it straight-forward
> >>   to find the next node in a kernfs directory.
> >>
> >> - A function kernfs_dir_skip_pos is factored out to skip over directory
> >>   entries that should not be shown to userspace.  This function is called
> >>   from kernfs_fop_readdir directly removing the complication of calling
> >>   it from the other directory advancement functions.
> >>
> >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
> >>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
> >>   and ensuring the directory position advancment functions can reference
> >>   the saved node without complications.
> >>
> >> - The function kernfs_dir_next_pos is modified to only advance the 
> >> directory
> >>   position in the common case when kernfs_dir_pos validates and returns
> >>   the starting kernfs node.  For all other cases kernfs_dir_pos is
> guaranteed
> >>   to return a directory position in advance of the starting directory
> position.
> >>
> >> - kernfs_dir_pos is given a significant make over.  It's essense remains
> the
> >>   same but some siginficant changes were made.
> >>
> >>   + The offset is validated to be a valid hash value at the very beginning.
> >> The validation is updated to handle the fact that neither 0 nor 1 are
> >> valid hash values.
> >>
> >>   + An extra test is added at the end of the rbtree lookup to ensure
> >> the returned position is at or beyond the target position.
> >>
> >>   + kernfs_name_compare is used during the rbtree lookup instead of just
> >> comparing
> >> the hash.  This allows the the passed namespace parameter to be used,
> and
> >> if it is available from the saved entry the old saved name as well.
> >>
> >>   + The validation of the saved kernfs node now checks for two cases.
> >> If the saved node is returnable.
> >> If the name of the saved node is usable during lookup.
> >>
> >> In net this should result in a easier to understand, and more correct
> >> version of directory traversal for kernfs directories.
> >>
> >> Reported-by: "Hatayama, Daisuke" 
> >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number 

RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Monday, June 4, 2018 11:45 PM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> "Hatayama, Daisuke"  writes:
> 
> > Hello,
> >
> > Thanks for this patch, Eric.
> >
> >> -Original Message-
> >> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> >> Sent: Monday, June 4, 2018 3:51 AM
> >> To: Hatayama, Daisuke 
> >> Cc: 'gre...@linuxfoundation.org' ;
> >> 't...@kernel.org' ; Okajima, Toshiyuki
> >> ; linux-kernel@vger.kernel.org;
> >> 'ebied...@aristanetworks.com' 
> >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> >>
> >>
> >> Over the years two bugs have crept into the code that handled the last
> >> returned kernfs directory being deleted, and handled general seeking
> >> in kernfs directories.  The result is that reading the /sys/module
> >> directory while moduled are loading and unloading will sometimes
> >> skip over a module that was present for the entire duration of
> >> the directory read.
> >>
> >> The first bug is that kernfs_dir_next_pos was advancing the position
> >> in the directory even when kernfs_dir_pos had found the starting
> >> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> >> guaranteed to return the directory entry past the starting directory
> >> entry, making it so that advancing the directory position is wrong.
> >>
> >> The second bug is that kernfs_dir_pos when the saved kernfs_node
> >> did not validate, was not always returning a directory after
> >> the the target position, but was sometimes returning the directory
> >> entry just before the target position.
> >>
> >> To correct these issues and to make the code easily readable and
> >> comprehensible several cleanups are made.
> >>
> >> - A function kernfs_dir_next is factored out to make it straight-forward
> >>   to find the next node in a kernfs directory.
> >>
> >> - A function kernfs_dir_skip_pos is factored out to skip over directory
> >>   entries that should not be shown to userspace.  This function is called
> >>   from kernfs_fop_readdir directly removing the complication of calling
> >>   it from the other directory advancement functions.
> >>
> >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
> >>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
> >>   and ensuring the directory position advancment functions can reference
> >>   the saved node without complications.
> >>
> >> - The function kernfs_dir_next_pos is modified to only advance the 
> >> directory
> >>   position in the common case when kernfs_dir_pos validates and returns
> >>   the starting kernfs node.  For all other cases kernfs_dir_pos is
> guaranteed
> >>   to return a directory position in advance of the starting directory
> position.
> >>
> >> - kernfs_dir_pos is given a significant make over.  It's essense remains
> the
> >>   same but some siginficant changes were made.
> >>
> >>   + The offset is validated to be a valid hash value at the very beginning.
> >> The validation is updated to handle the fact that neither 0 nor 1 are
> >> valid hash values.
> >>
> >>   + An extra test is added at the end of the rbtree lookup to ensure
> >> the returned position is at or beyond the target position.
> >>
> >>   + kernfs_name_compare is used during the rbtree lookup instead of just
> >> comparing
> >> the hash.  This allows the the passed namespace parameter to be used,
> and
> >> if it is available from the saved entry the old saved name as well.
> >>
> >>   + The validation of the saved kernfs node now checks for two cases.
> >> If the saved node is returnable.
> >> If the name of the saved node is usable during lookup.
> >>
> >> In net this should result in a easier to understand, and more correct
> >> version of directory traversal for kernfs directories.
> >>
> >> Reported-by: "Hatayama, Daisuke" 
> >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number 

RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org'
> Sent: Saturday, June 2, 2018 2:07 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ; Okajima,
> Toshiyuki ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Fri, Jun 01, 2018 at 09:25:32AM +, Hatayama, Daisuke wrote:
> > kernfs_dir_pos() checks if a kernfs_node object given as one of its
> > arguments is still active and if so returns it, or returns a
> > kernfs_node object that is most equal (possibly smaller and larger) to
> > the given object.
> 
> Sometimes they're duplicate operations tho, which is exactly the bug
> the posted patch is trying to fix.  What I'm suggesting is instead of
> leaving both instances and skipping one conditionally, put them in one
> place and trigger only when necessary.  The sequence of operations
> would be exactly the same.  The only difference is how the code is
> organized.
> 

I see and I think Eric's patch is written as you suggest very well.

> > kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> > object given by kernfs_dir_pos().
> >
> > Two functions does different things and both need to skip inactive
> > nodes. I don't think it natural to remove the skip only from
> > kernfs_dir_pos().
> >
> > OTOH, throughout getdents(), there is no case that the kernfs_node
> > object given to kernfs_dir_pos() is used afterwards in the
> > processing. So, is it enough to provide kernfs_dir_next_pos() only?
> > Then, the skip code is now not duplicated.
> >
> > The patch below is my thought. How do you think?
> >
> > But note that this patch has some bug so that system boot get hang
> > without detecting root filesystem disk :) I'm debugging this now.
> 
> I haven't looked into the code that closely but given that we had
> cases where both skippings were fine and not, the condition is likely
> gonna be a bit tricker?

I agree to this version looks a bit tricker. But I think once the skipping
code is separated as Eric's patch, it would be resolved naturally.

> 
> Thanks.
> 
> --
> tejun
> 




RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-06-04 Thread Hatayama, Daisuke



> -Original Message-
> From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org'
> Sent: Saturday, June 2, 2018 2:07 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ; Okajima,
> Toshiyuki ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Fri, Jun 01, 2018 at 09:25:32AM +, Hatayama, Daisuke wrote:
> > kernfs_dir_pos() checks if a kernfs_node object given as one of its
> > arguments is still active and if so returns it, or returns a
> > kernfs_node object that is most equal (possibly smaller and larger) to
> > the given object.
> 
> Sometimes they're duplicate operations tho, which is exactly the bug
> the posted patch is trying to fix.  What I'm suggesting is instead of
> leaving both instances and skipping one conditionally, put them in one
> place and trigger only when necessary.  The sequence of operations
> would be exactly the same.  The only difference is how the code is
> organized.
> 

I see and I think Eric's patch is written as you suggest very well.

> > kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> > object given by kernfs_dir_pos().
> >
> > Two functions does different things and both need to skip inactive
> > nodes. I don't think it natural to remove the skip only from
> > kernfs_dir_pos().
> >
> > OTOH, throughout getdents(), there is no case that the kernfs_node
> > object given to kernfs_dir_pos() is used afterwards in the
> > processing. So, is it enough to provide kernfs_dir_next_pos() only?
> > Then, the skip code is now not duplicated.
> >
> > The patch below is my thought. How do you think?
> >
> > But note that this patch has some bug so that system boot get hang
> > without detecting root filesystem disk :) I'm debugging this now.
> 
> I haven't looked into the code that closely but given that we had
> cases where both skippings were fine and not, the condition is likely
> gonna be a bit tricker?

I agree to this version looks a bit tricker. But I think once the skipping
code is separated as Eric's patch, it would be resolved naturally.

> 
> Thanks.
> 
> --
> tejun
> 




RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke
Hello,

Thanks for this patch, Eric.

> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Monday, June 4, 2018 3:51 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> 
> Over the years two bugs have crept into the code that handled the last
> returned kernfs directory being deleted, and handled general seeking
> in kernfs directories.  The result is that reading the /sys/module
> directory while moduled are loading and unloading will sometimes
> skip over a module that was present for the entire duration of
> the directory read.
> 
> The first bug is that kernfs_dir_next_pos was advancing the position
> in the directory even when kernfs_dir_pos had found the starting
> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> guaranteed to return the directory entry past the starting directory
> entry, making it so that advancing the directory position is wrong.
> 
> The second bug is that kernfs_dir_pos when the saved kernfs_node
> did not validate, was not always returning a directory after
> the the target position, but was sometimes returning the directory
> entry just before the target position.
> 
> To correct these issues and to make the code easily readable and
> comprehensible several cleanups are made.
> 
> - A function kernfs_dir_next is factored out to make it straight-forward
>   to find the next node in a kernfs directory.
> 
> - A function kernfs_dir_skip_pos is factored out to skip over directory
>   entries that should not be shown to userspace.  This function is called
>   from kernfs_fop_readdir directly removing the complication of calling
>   it from the other directory advancement functions.
> 
> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
>   and ensuring the directory position advancment functions can reference
>   the saved node without complications.
> 
> - The function kernfs_dir_next_pos is modified to only advance the directory
>   position in the common case when kernfs_dir_pos validates and returns
>   the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
>   to return a directory position in advance of the starting directory 
> position.
> 
> - kernfs_dir_pos is given a significant make over.  It's essense remains the
>   same but some siginficant changes were made.
> 
>   + The offset is validated to be a valid hash value at the very beginning.
> The validation is updated to handle the fact that neither 0 nor 1 are
> valid hash values.
> 
>   + An extra test is added at the end of the rbtree lookup to ensure
> the returned position is at or beyond the target position.
> 
>   + kernfs_name_compare is used during the rbtree lookup instead of just
> comparing
> the hash.  This allows the the passed namespace parameter to be used, and
> if it is available from the saved entry the old saved name as well.
> 
>   + The validation of the saved kernfs node now checks for two cases.
> If the saved node is returnable.
>     If the name of the saved node is usable during lookup.
> 
> In net this should result in a easier to understand, and more correct
> version of directory traversal for kernfs directories.
>
> Reported-by: "Hatayama, Daisuke" 
> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> scalability v2")
> Signed-off-by: "Eric W. Biederman" 
> ---
> 
> Can you test this and please verify it fixes your issue?
>

I tried this patch on top of v4.17 but the system fails to boot
without detecting root disks by dracut like this:

[  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  199.945999] dracut-initqueue[324]: Warning: dracut-initque

RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-04 Thread Hatayama, Daisuke
Hello,

Thanks for this patch, Eric.

> -Original Message-
> From: Eric W. Biederman [mailto:ebied...@xmission.com]
> Sent: Monday, June 4, 2018 3:51 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' ; Okajima, Toshiyuki 
> ; linux-kernel@vger.kernel.org;
> 'ebied...@aristanetworks.com' 
> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> 
> Over the years two bugs have crept into the code that handled the last
> returned kernfs directory being deleted, and handled general seeking
> in kernfs directories.  The result is that reading the /sys/module
> directory while moduled are loading and unloading will sometimes
> skip over a module that was present for the entire duration of
> the directory read.
> 
> The first bug is that kernfs_dir_next_pos was advancing the position
> in the directory even when kernfs_dir_pos had found the starting
> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> guaranteed to return the directory entry past the starting directory
> entry, making it so that advancing the directory position is wrong.
> 
> The second bug is that kernfs_dir_pos when the saved kernfs_node
> did not validate, was not always returning a directory after
> the the target position, but was sometimes returning the directory
> entry just before the target position.
> 
> To correct these issues and to make the code easily readable and
> comprehensible several cleanups are made.
> 
> - A function kernfs_dir_next is factored out to make it straight-forward
>   to find the next node in a kernfs directory.
> 
> - A function kernfs_dir_skip_pos is factored out to skip over directory
>   entries that should not be shown to userspace.  This function is called
>   from kernfs_fop_readdir directly removing the complication of calling
>   it from the other directory advancement functions.
> 
> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
>   and ensuring the directory position advancment functions can reference
>   the saved node without complications.
> 
> - The function kernfs_dir_next_pos is modified to only advance the directory
>   position in the common case when kernfs_dir_pos validates and returns
>   the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
>   to return a directory position in advance of the starting directory 
> position.
> 
> - kernfs_dir_pos is given a significant make over.  It's essense remains the
>   same but some siginficant changes were made.
> 
>   + The offset is validated to be a valid hash value at the very beginning.
> The validation is updated to handle the fact that neither 0 nor 1 are
> valid hash values.
> 
>   + An extra test is added at the end of the rbtree lookup to ensure
> the returned position is at or beyond the target position.
> 
>   + kernfs_name_compare is used during the rbtree lookup instead of just
> comparing
> the hash.  This allows the the passed namespace parameter to be used, and
> if it is available from the saved entry the old saved name as well.
> 
>   + The validation of the saved kernfs node now checks for two cases.
> If the saved node is returnable.
>     If the name of the saved node is usable during lookup.
> 
> In net this should result in a easier to understand, and more correct
> version of directory traversal for kernfs directories.
>
> Reported-by: "Hatayama, Daisuke" 
> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> scalability v2")
> Signed-off-by: "Eric W. Biederman" 
> ---
> 
> Can you test this and please verify it fixes your issue?
>

I tried this patch on top of v4.17 but the system fails to boot
without detecting root disks by dracut like this:

[  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - 
starting timeout scripts
[  199.945999] dracut-initqueue[324]: Warning: dracut-initque

RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-06-01 Thread Hatayama, Daisuke
> -Original Message-
> From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org'
> Sent: Wednesday, May 30, 2018 1:26 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ; Okajima,
> Toshiyuki ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Mon, May 28, 2018 at 12:54:03PM +, Hatayama, Daisuke wrote:
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 89d1dc1..3aeeb7a 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode 
> > *inode,
> struct file *filp)
> >  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> >  {
> > +   struct kernfs_node *orig = pos;
> > +
> > pos = kernfs_dir_pos(ns, parent, ino, pos);
> > -   if (pos) {
> > +   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
> 
> Hmm... the code seems a bit unintuitive to me and I wonder whether
> it's because there are two identical skipping loops in
> kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
> selectively disable one of them.  Wouldn't it make more sense to get
> rid of it from kernfs_dir_pos() and skip explicitly only when
> necessary?
> 

kernfs_dir_pos() checks if a kernfs_node object given as one of its
arguments is still active and if so returns it, or returns a
kernfs_node object that is most equal (possibly smaller and larger) to
the given object.

kernfs_dir_next_pos() returns a kernfs_node object that is next to the
object given by kernfs_dir_pos().

Two functions does different things and both need to skip inactive
nodes. I don't think it natural to remove the skip only from
kernfs_dir_pos().

OTOH, throughout getdents(), there is no case that the kernfs_node
object given to kernfs_dir_pos() is used afterwards in the
processing. So, is it enough to provide kernfs_dir_next_pos() only?
Then, the skip code is now not duplicated.

The patch below is my thought. How do you think?

But note that this patch has some bug so that system boot get hang
without detecting root filesystem disk :) I'm debugging this now.

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..140706f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
return 0;
 }

-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
if (pos) {
int valid = kernfs_active(pos) &&
pos->parent == parent && hash == pos->hash;
@@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
}
}
/* Skip over entries which are dying/dead or in the wrong namespace */
-   while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+   while (pos && (!kernfs_active(pos) ||
+  (!orig || kernfs_sd_compare(pos, orig) <= 0) ||
+  pos->ns != ns)) {
struct rb_node *node = rb_next(>rb);
if (!node)
pos = NULL;
@@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
return pos;
 }

-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
-   pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
-   do {
-   struct rb_node *node = rb_next(>rb);
-   if (!node)
-   pos = NULL;
-   else
-   pos = rb_to_kn(node);
-   } while (pos && (!kernfs_active(pos) || pos->ns != ns));
-   }
-   return pos;
-}
-
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
struct dentry *dentry = file->f_path.dentry;
@@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;

-   for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+   for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos);
 pos;
 pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
const char *name = pos->name;

Thanks.
HATAYAMA, Daisuke




RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-06-01 Thread Hatayama, Daisuke
> -Original Message-
> From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of 't...@kernel.org'
> Sent: Wednesday, May 30, 2018 1:26 AM
> To: Hatayama, Daisuke 
> Cc: 'gre...@linuxfoundation.org' ; Okajima,
> Toshiyuki ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Mon, May 28, 2018 at 12:54:03PM +, Hatayama, Daisuke wrote:
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 89d1dc1..3aeeb7a 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode 
> > *inode,
> struct file *filp)
> >  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> >  {
> > +   struct kernfs_node *orig = pos;
> > +
> > pos = kernfs_dir_pos(ns, parent, ino, pos);
> > -   if (pos) {
> > +   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
> 
> Hmm... the code seems a bit unintuitive to me and I wonder whether
> it's because there are two identical skipping loops in
> kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
> selectively disable one of them.  Wouldn't it make more sense to get
> rid of it from kernfs_dir_pos() and skip explicitly only when
> necessary?
> 

kernfs_dir_pos() checks if a kernfs_node object given as one of its
arguments is still active and if so returns it, or returns a
kernfs_node object that is most equal (possibly smaller and larger) to
the given object.

kernfs_dir_next_pos() returns a kernfs_node object that is next to the
object given by kernfs_dir_pos().

Two functions does different things and both need to skip inactive
nodes. I don't think it natural to remove the skip only from
kernfs_dir_pos().

OTOH, throughout getdents(), there is no case that the kernfs_node
object given to kernfs_dir_pos() is used afterwards in the
processing. So, is it enough to provide kernfs_dir_next_pos() only?
Then, the skip code is now not duplicated.

The patch below is my thought. How do you think?

But note that this patch has some bug so that system boot get hang
without detecting root filesystem disk :) I'm debugging this now.

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..140706f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
return 0;
 }

-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
if (pos) {
int valid = kernfs_active(pos) &&
pos->parent == parent && hash == pos->hash;
@@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
}
}
/* Skip over entries which are dying/dead or in the wrong namespace */
-   while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+   while (pos && (!kernfs_active(pos) ||
+  (!orig || kernfs_sd_compare(pos, orig) <= 0) ||
+  pos->ns != ns)) {
struct rb_node *node = rb_next(>rb);
if (!node)
pos = NULL;
@@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
return pos;
 }

-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
-   pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
-   do {
-   struct rb_node *node = rb_next(>rb);
-   if (!node)
-   pos = NULL;
-   else
-   pos = rb_to_kn(node);
-   } while (pos && (!kernfs_active(pos) || pos->ns != ns));
-   }
-   return pos;
-}
-
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
struct dentry *dentry = file->f_path.dentry;
@@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;

-   for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+   for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos);
 pos;
 pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
const char *name = pos->name;

Thanks.
HATAYAMA, Daisuke




RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-05-28 Thread Hatayama, Daisuke
Hi,

I attach a reproducer for this issue.

Expand repro.tar.gz and run two commands:

  # make
  # ./repro.sh

Then, repro.sh will terminate when the issue is reproduced.

In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same 
hash value.
Then, install U1cN9ZbARQ.ko and then repeating installing and removing 
AtvbAC0jwH.ko over and over.
Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison.
The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko.

I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from 
fs/kernfs/dirs.c,
contained in repro.tar.gz like:

# ./kernfshash
AtvbAC0jwH U1cN9ZbARQ 6b2609c5

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke
> Sent: Monday, May 28, 2018 9:54 PM
> To: 'gre...@linuxfoundation.org' <gre...@linuxfoundation.org>;
> 't...@kernel.org' <t...@kernel.org>
> Cc: Okajima, Toshiyuki <toshi.okaj...@jp.fujitsu.com>;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> <ebied...@aristanetworks.com>
> Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
> 
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
> 
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
> 
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
> 
> v2: Fix the case where nodes with the same hash but with the name
> larger than the original node could still be skipped. Use
> kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
> description.
> 
> Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
> Cc: Eric W. Biederman <ebied...@aristanetworks.com>
> ---
>  fs/kernfs/dir.c |4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
> + struct kernfs_node *orig = pos;
> +
>   pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && kernfs_sd_compare(pos, orig) <= 0) {
>   do {
>   struct rb_node *node = rb_next(>rb);
>   if (!node)
> --
> 1.7.1
> 
> 
> 
> 
> 
> 



repro.tar.gz
Description: repro.tar.gz


RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-05-28 Thread Hatayama, Daisuke
Hi,

I attach a reproducer for this issue.

Expand repro.tar.gz and run two commands:

  # make
  # ./repro.sh

Then, repro.sh will terminate when the issue is reproduced.

In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same 
hash value.
Then, install U1cN9ZbARQ.ko and then repeating installing and removing 
AtvbAC0jwH.ko over and over.
Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison.
The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko.

I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from 
fs/kernfs/dirs.c,
contained in repro.tar.gz like:

# ./kernfshash
AtvbAC0jwH U1cN9ZbARQ 6b2609c5

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke
> Sent: Monday, May 28, 2018 9:54 PM
> To: 'gre...@linuxfoundation.org' ;
> 't...@kernel.org' 
> Cc: Okajima, Toshiyuki ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
> 
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
> 
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
> 
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
> 
> v2: Fix the case where nodes with the same hash but with the name
> larger than the original node could still be skipped. Use
> kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
> description.
> 
> Signed-off-by: HATAYAMA Daisuke 
> Suggested-by: Toshiyuki Okajima 
> Cc: Eric W. Biederman 
> ---
>  fs/kernfs/dir.c |4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
> + struct kernfs_node *orig = pos;
> +
>   pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && kernfs_sd_compare(pos, orig) <= 0) {
>   do {
>   struct rb_node *node = rb_next(>rb);
>   if (!node)
> --
> 1.7.1
> 
> 
> 
> 
> 
> 



repro.tar.gz
Description: repro.tar.gz


[RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-05-28 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.

Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.

We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.

v2: Fix the case where nodes with the same hash but with the name
larger than the original node could still be skipped. Use
kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
description.

Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
Cc: Eric W. Biederman <ebied...@aristanetworks.com>
---
 fs/kernfs/dir.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1







[RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-05-28 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.

Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.

We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.

v2: Fix the case where nodes with the same hash but with the name
larger than the original node could still be skipped. Use
kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
description.

Signed-off-by: HATAYAMA Daisuke 
Suggested-by: Toshiyuki Okajima 
Cc: Eric W. Biederman 
---
 fs/kernfs/dir.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1







[PATCH v2] kernfs: fix dentry unexpected skip

2018-05-21 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.

Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.

We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.

v2: Fix the case where nodes with the same hash but with the name
larger than the original node could still be skipped. Use
kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
description.

Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
Cc: Eric W. Biederman <ebied...@aristanetworks.com>
---
 fs/kernfs/dir.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




[PATCH v2] kernfs: fix dentry unexpected skip

2018-05-21 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.

Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.

We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.

v2: Fix the case where nodes with the same hash but with the name
larger than the original node could still be skipped. Use
kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
description.

Signed-off-by: HATAYAMA Daisuke 
Suggested-by: Toshiyuki Okajima 
Cc: Eric W. Biederman 
---
 fs/kernfs/dir.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+   struct kernfs_node *orig = pos;
+
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && kernfs_sd_compare(pos, orig) <= 0) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




RE: [PATCH] kernfs: fix dentry unexpected skip

2018-05-20 Thread Hatayama, Daisuke


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke
> Sent: Saturday, May 19, 2018 12:43 AM
> To: 'gre...@linuxfoundation.org' <gre...@linuxfoundation.org>
> Cc: Okajima, Toshiyuki/岡嶋 寿行 <toshi.okaj...@jp.fujitsu.com>;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> <ebied...@aristanetworks.com>
> Subject: [PATCH] kernfs: fix dentry unexpected skip
> 
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
> 
> To fix this issue, try to find a next node only when the returned
> object has a hash value equal to or smaller than the original one.
> 
> Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
> Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
> Cc: Eric W. Biederman <ebied...@aristanetworks.com>
> ---
>  fs/kernfs/dir.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..8a2f49c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
>   pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && pos->hash <= ino) {

I found this condition still misses the case that the returend pos with
the same hash value but with different name is skipped. I'll post
v2 patch.

>   do {
>   struct rb_node *node = rb_next(>rb);
>   if (!node)
> --
> 1.7.1
> 
> 
> 




RE: [PATCH] kernfs: fix dentry unexpected skip

2018-05-20 Thread Hatayama, Daisuke


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Hatayama, Daisuke
> Sent: Saturday, May 19, 2018 12:43 AM
> To: 'gre...@linuxfoundation.org' 
> Cc: Okajima, Toshiyuki/岡嶋 寿行 ;
> linux-kernel@vger.kernel.org; 'ebied...@aristanetworks.com'
> 
> Subject: [PATCH] kernfs: fix dentry unexpected skip
> 
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
> 
> To fix this issue, try to find a next node only when the returned
> object has a hash value equal to or smaller than the original one.
> 
> Signed-off-by: HATAYAMA Daisuke 
> Suggested-by: Toshiyuki Okajima 
> Cc: Eric W. Biederman 
> ---
>  fs/kernfs/dir.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..8a2f49c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
>   pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && pos->hash <= ino) {

I found this condition still misses the case that the returend pos with
the same hash value but with different name is skipped. I'll post
v2 patch.

>   do {
>   struct rb_node *node = rb_next(>rb);
>   if (!node)
> --
> 1.7.1
> 
> 
> 




[PATCH] kernfs: fix dentry unexpected skip

2018-05-18 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object has a hash value equal to or smaller than the original one.

Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
Cc: Eric W. Biederman <ebied...@aristanetworks.com>
---
 fs/kernfs/dir.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..8a2f49c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && pos->hash <= ino) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




[PATCH] kernfs: fix dentry unexpected skip

2018-05-18 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object has a hash value equal to or smaller than the original one.

Signed-off-by: HATAYAMA Daisuke 
Suggested-by: Toshiyuki Okajima 
Cc: Eric W. Biederman 
---
 fs/kernfs/dir.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..8a2f49c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && pos->hash <= ino) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




[PATCH] kernfs: fix dentry unexpected skip

2018-05-18 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object has a hash value equal to or smaller than the original one.

Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com>
Suggested-by: Toshiyuki Okajima <toshi.okaj...@jp.fujitsu.com>
Cc: Eric W. Biederman <ebied...@aristanetworks.com>
---
 fs/kernfs/dir.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..8a2f49c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && pos->hash <= ino) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




[PATCH] kernfs: fix dentry unexpected skip

2018-05-18 Thread Hatayama, Daisuke
kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object has a hash value equal to or smaller than the original one.

Signed-off-by: HATAYAMA Daisuke 
Suggested-by: Toshiyuki Okajima 
Cc: Eric W. Biederman 
---
 fs/kernfs/dir.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..8a2f49c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1622,7 +1622,7 @@ static int kernfs_dir_fop_release(struct inode *inode, 
struct file *filp)
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
pos = kernfs_dir_pos(ns, parent, ino, pos);
-   if (pos) {
+   if (pos && pos->hash <= ino) {
do {
struct rb_node *node = rb_next(>rb);
if (!node)
-- 
1.7.1




RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support

2017-11-12 Thread Hatayama, Daisuke
Marc-Andre,

It looks to me that the 4th and 5th patches somehow has not been sent.
Could you send them, too?
I'd like them to actually build and run the kernel for testing.

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau
> Sent: Wednesday, November 8, 2017 1:24 AM
> To: linux-kernel@vger.kernel.org
> Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau
> 
> Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
> 
> Hi,
> 
> This series adds DMA operations support to the qemu fw_cfg kernel
> module and populates "etc/vmcoreinfo" with vmcoreinfo location
> details.
> 
> Note: the support for this entry handling has been merged for next
> qemu release (2.11)
> 
> v5:
> - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch
> 
> v4:
> - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build
> - fix build with !CONFIG_CRASH_CORE
> - replace the unbounded yield() loop with a usleep_range() loop and a
>   200ms timeout
> - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama)
> - drop the experimental sysfs write support patch from this series
> 
> v3: (thanks kbuild)
> - add "fw_cfg: fix the command line module name" patch
> - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y
> - fix 'Wshift-count-overflow'
> 
> v2:
> - use platform device for dma mapping
> - add etc/vmcoreinfo patch
> - some code cleanups
> 
> Marc-André Lureau (5):
>   fw_cfg: fix the command line module name
>   fw_cfg: add DMA register
>   fw_cfg: do DMA read operation
>   crash: export paddr_vmcoreinfo_note()
>   fw_cfg: write vmcoreinfo details
> 
>  drivers/firmware/qemu_fw_cfg.c | 292
> +
>  kernel/crash_core.c|   1 +
>  2 files changed, 264 insertions(+), 29 deletions(-)
> 
> --
> 2.15.0.125.g8f49766d64
> 
> 



RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support

2017-11-12 Thread Hatayama, Daisuke
Marc-Andre,

It looks to me that the 4th and 5th patches somehow has not been sent.
Could you send them, too?
I'd like them to actually build and run the kernel for testing.

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau
> Sent: Wednesday, November 8, 2017 1:24 AM
> To: linux-kernel@vger.kernel.org
> Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau
> 
> Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
> 
> Hi,
> 
> This series adds DMA operations support to the qemu fw_cfg kernel
> module and populates "etc/vmcoreinfo" with vmcoreinfo location
> details.
> 
> Note: the support for this entry handling has been merged for next
> qemu release (2.11)
> 
> v5:
> - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch
> 
> v4:
> - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build
> - fix build with !CONFIG_CRASH_CORE
> - replace the unbounded yield() loop with a usleep_range() loop and a
>   200ms timeout
> - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama)
> - drop the experimental sysfs write support patch from this series
> 
> v3: (thanks kbuild)
> - add "fw_cfg: fix the command line module name" patch
> - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y
> - fix 'Wshift-count-overflow'
> 
> v2:
> - use platform device for dma mapping
> - add etc/vmcoreinfo patch
> - some code cleanups
> 
> Marc-André Lureau (5):
>   fw_cfg: fix the command line module name
>   fw_cfg: add DMA register
>   fw_cfg: do DMA read operation
>   crash: export paddr_vmcoreinfo_note()
>   fw_cfg: write vmcoreinfo details
> 
>  drivers/firmware/qemu_fw_cfg.c | 292
> +
>  kernel/crash_core.c|   1 +
>  2 files changed, 264 insertions(+), 29 deletions(-)
> 
> --
> 2.15.0.125.g8f49766d64
> 
> 



RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs

2017-10-31 Thread Hatayama, Daisuke


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin
> Sent: Tuesday, October 31, 2017 1:19 AM
> To: Hatayama, Daisuke <d.hatay...@jp.fujitsu.com>
> Cc: 'marcandre.lur...@redhat.com' <marcandre.lur...@redhat.com>;
> 'so...@cmu.edu' <so...@cmu.edu>; linux-kernel@vger.kernel.org;
> 'qemu-de...@nongnu.org' <qemu-de...@nongnu.org>
> Subject: Re: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> 
> On Mon, Oct 30, 2017 at 11:07:20AM +, Hatayama, Daisuke wrote:
> > > From: Marc-André Lureau <marcandre.lur...@redhat.com>
> > >
> > > Since qemu 2.9, DMA write operations are allowed. However, usage of this
> > > interface from kernel or user-space is strongly discouraged by the
> > > maintainers. This patch is meant for experimentations for now.
> > >
> >
> > Could you (or maintainers?) tell me how experimental the DMA write
> > operations from kernel are?
> >
> > From some technical reason?
> > Or simply there has not been enough test yet so far?
> 
> The concern is security, talking from userspace to hypervisor
> might become a trivial DOS vector.
> 
> If there's need for a specific entry to be accessible from userspace,
> I'd rather white-list it.
> 

Thanks for the explanation.

If so, I guess the white-list is done in userspace qemu side,
and no additional change would be needed for this kernel-side patch set for the 
white-list feature.
Is this understanding correct?

> > > Signed-off-by: Marc-André Lureau <marcandre.lur...@redhat.com>
> > > ---
> > >  drivers/firmware/qemu_fw_cfg.c | 21 -
> > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/firmware/qemu_fw_cfg.c
> b/drivers/firmware/qemu_fw_cfg.c
> > > index 54b569da3257..e2f2ad1c9c0c 100644
> > > --- a/drivers/firmware/qemu_fw_cfg.c
> > > +++ b/drivers/firmware/qemu_fw_cfg.c
> > > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file 
> > > *filp,
> struct kobject *kobj,
> > >   return fw_cfg_read_blob(entry->f.select, buf, pos, count, true);
> > >  }
> > >
> > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject
> *kobj,
> > > +   struct bin_attribute *bin_attr,
> > > +   char *buf, loff_t pos, size_t count)
> > > +{
> > > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj);
> > > +
> > > + if (!fw_cfg_dma_enabled())
> > > + return -ENOTSUPP;
> > > +
> > > + if (pos > entry->f.size)
> > > + return -EINVAL;
> > > +
> > > + if (count > entry->f.size - pos)
> > > + count = entry->f.size - pos;
> > > +
> > > + return fw_cfg_write_blob(entry->f.select, buf, pos, count);
> > > +}
> > > +
> > >  static struct bin_attribute fw_cfg_sysfs_attr_raw = {
> > > - .attr = { .name = "raw", .mode = S_IRUSR },
> > > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR },
> > >   .read = fw_cfg_sysfs_read_raw,
> > > + .write = fw_cfg_sysfs_write_raw,
> > >  };
> > >
> > >  /*
> > > --
> > > 2.14.1.146.gd35faa819
> >
> > Thanks.
> > HATAYAMA, Daisuke
> >
> 



RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs

2017-10-31 Thread Hatayama, Daisuke


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin
> Sent: Tuesday, October 31, 2017 1:19 AM
> To: Hatayama, Daisuke 
> Cc: 'marcandre.lur...@redhat.com' ;
> 'so...@cmu.edu' ; linux-kernel@vger.kernel.org;
> 'qemu-de...@nongnu.org' 
> Subject: Re: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs
> 
> On Mon, Oct 30, 2017 at 11:07:20AM +, Hatayama, Daisuke wrote:
> > > From: Marc-André Lureau 
> > >
> > > Since qemu 2.9, DMA write operations are allowed. However, usage of this
> > > interface from kernel or user-space is strongly discouraged by the
> > > maintainers. This patch is meant for experimentations for now.
> > >
> >
> > Could you (or maintainers?) tell me how experimental the DMA write
> > operations from kernel are?
> >
> > From some technical reason?
> > Or simply there has not been enough test yet so far?
> 
> The concern is security, talking from userspace to hypervisor
> might become a trivial DOS vector.
> 
> If there's need for a specific entry to be accessible from userspace,
> I'd rather white-list it.
> 

Thanks for the explanation.

If so, I guess the white-list is done in userspace qemu side,
and no additional change would be needed for this kernel-side patch set for the 
white-list feature.
Is this understanding correct?

> > > Signed-off-by: Marc-André Lureau 
> > > ---
> > >  drivers/firmware/qemu_fw_cfg.c | 21 -
> > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/firmware/qemu_fw_cfg.c
> b/drivers/firmware/qemu_fw_cfg.c
> > > index 54b569da3257..e2f2ad1c9c0c 100644
> > > --- a/drivers/firmware/qemu_fw_cfg.c
> > > +++ b/drivers/firmware/qemu_fw_cfg.c
> > > @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file 
> > > *filp,
> struct kobject *kobj,
> > >   return fw_cfg_read_blob(entry->f.select, buf, pos, count, true);
> > >  }
> > >
> > > +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject
> *kobj,
> > > +   struct bin_attribute *bin_attr,
> > > +   char *buf, loff_t pos, size_t count)
> > > +{
> > > + struct fw_cfg_sysfs_entry *entry = to_entry(kobj);
> > > +
> > > + if (!fw_cfg_dma_enabled())
> > > + return -ENOTSUPP;
> > > +
> > > + if (pos > entry->f.size)
> > > + return -EINVAL;
> > > +
> > > + if (count > entry->f.size - pos)
> > > + count = entry->f.size - pos;
> > > +
> > > +     return fw_cfg_write_blob(entry->f.select, buf, pos, count);
> > > +}
> > > +
> > >  static struct bin_attribute fw_cfg_sysfs_attr_raw = {
> > > - .attr = { .name = "raw", .mode = S_IRUSR },
> > > + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR },
> > >   .read = fw_cfg_sysfs_read_raw,
> > > + .write = fw_cfg_sysfs_write_raw,
> > >  };
> > >
> > >  /*
> > > --
> > > 2.14.1.146.gd35faa819
> >
> > Thanks.
> > HATAYAMA, Daisuke
> >
> 



RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs

2017-10-30 Thread Hatayama, Daisuke
> From: Marc-André Lureau <marcandre.lur...@redhat.com>
>
> Since qemu 2.9, DMA write operations are allowed. However, usage of this
> interface from kernel or user-space is strongly discouraged by the
> maintainers. This patch is meant for experimentations for now.
>

Could you (or maintainers?) tell me how experimental the DMA write
operations from kernel are?

From some technical reason?
Or simply there has not been enough test yet so far?

> Signed-off-by: Marc-André Lureau <marcandre.lur...@redhat.com>
> ---
>  drivers/firmware/qemu_fw_cfg.c | 21 -
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
> index 54b569da3257..e2f2ad1c9c0c 100644
> --- a/drivers/firmware/qemu_fw_cfg.c
> +++ b/drivers/firmware/qemu_fw_cfg.c
> @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file *filp, 
> struct kobject *kobj,
>   return fw_cfg_read_blob(entry->f.select, buf, pos, count, true);
>  }
>
> +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject 
> *kobj,
> +   struct bin_attribute *bin_attr,
> +   char *buf, loff_t pos, size_t count)
> +{
> + struct fw_cfg_sysfs_entry *entry = to_entry(kobj);
> +
> + if (!fw_cfg_dma_enabled())
> + return -ENOTSUPP;
> +
> + if (pos > entry->f.size)
> + return -EINVAL;
> +
> + if (count > entry->f.size - pos)
> + count = entry->f.size - pos;
> +
> + return fw_cfg_write_blob(entry->f.select, buf, pos, count);
> +}
> +
>  static struct bin_attribute fw_cfg_sysfs_attr_raw = {
> - .attr = { .name = "raw", .mode = S_IRUSR },
> + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR },
>   .read = fw_cfg_sysfs_read_raw,
> + .write = fw_cfg_sysfs_write_raw,
>  };
>
>  /*
> --
> 2.14.1.146.gd35faa819

Thanks.
HATAYAMA, Daisuke



RE: [PATCH v3 5/5] RFC: fw_cfg: add DMA write operation in sysfs

2017-10-30 Thread Hatayama, Daisuke
> From: Marc-André Lureau 
>
> Since qemu 2.9, DMA write operations are allowed. However, usage of this
> interface from kernel or user-space is strongly discouraged by the
> maintainers. This patch is meant for experimentations for now.
>

Could you (or maintainers?) tell me how experimental the DMA write
operations from kernel are?

From some technical reason?
Or simply there has not been enough test yet so far?

> Signed-off-by: Marc-André Lureau 
> ---
>  drivers/firmware/qemu_fw_cfg.c | 21 -
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
> index 54b569da3257..e2f2ad1c9c0c 100644
> --- a/drivers/firmware/qemu_fw_cfg.c
> +++ b/drivers/firmware/qemu_fw_cfg.c
> @@ -524,9 +524,28 @@ static ssize_t fw_cfg_sysfs_read_raw(struct file *filp, 
> struct kobject *kobj,
>   return fw_cfg_read_blob(entry->f.select, buf, pos, count, true);
>  }
>
> +static ssize_t fw_cfg_sysfs_write_raw(struct file *filp, struct kobject 
> *kobj,
> +   struct bin_attribute *bin_attr,
> +   char *buf, loff_t pos, size_t count)
> +{
> + struct fw_cfg_sysfs_entry *entry = to_entry(kobj);
> +
> + if (!fw_cfg_dma_enabled())
> + return -ENOTSUPP;
> +
> + if (pos > entry->f.size)
> + return -EINVAL;
> +
> + if (count > entry->f.size - pos)
> + count = entry->f.size - pos;
> +
> + return fw_cfg_write_blob(entry->f.select, buf, pos, count);
> +}
> +
>  static struct bin_attribute fw_cfg_sysfs_attr_raw = {
> - .attr = { .name = "raw", .mode = S_IRUSR },
> + .attr = { .name = "raw", .mode = S_IRUSR | S_IWUSR },
>   .read = fw_cfg_sysfs_read_raw,
> + .write = fw_cfg_sysfs_write_raw,
>  };
>
>  /*
> --
> 2.14.1.146.gd35faa819

Thanks.
HATAYAMA, Daisuke



Re: [PATCH V2] proc-vmcore: wrong data type casting fix

2016-03-13 Thread HATAYAMA Daisuke
From: Baoquan He <b...@redhat.com>
Subject: Re: [PATCH V2] proc-vmcore: wrong data type casting fix
Date: Mon, 14 Mar 2016 11:50:53 +0800

> On 03/14/16 at 11:47am, Baoquan He wrote:
>> On 03/14/16 at 12:25pm, HATAYAMA Daisuke wrote:
>> > From: Dave Young <dyo...@redhat.com>
>> > Subject: [PATCH V2] proc-vmcore: wrong data type casting fix
>> > Date: Fri, 11 Mar 2016 16:42:48 +0800
>> > 
>> > > On i686 PAE enabled machine the contiguous physical area could be large
>> > > and it can cause trimming down variables in below calculation in
>> > > read_vmcore() and mmap_vmcore():
>> > > 
>> > >  tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
>> > > 
>> > > Then the real size passed down is not correct any more.
>> > > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then
>> > 
>> > That is, size_t and loff_t are defined as follows on i686:
>> > 
>> > (gdb) ptype size_t
>> > type = unsigned int
>> > (gdb) ptype loff_t
>> > type = long long int
>> > 
>> > So casting by size_t means truncating a given value by 4GB.
>> > 
>> > Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB,
>> > and is aligned with 4GB, the resulted value is 0, and
>> 
>> Truncating doesn't mean align. If (m->offset + m->size - *fpos) is
>> larger than 4G, E.g 0x1000f, the truncating result will be 0xf.
>>

Thanks for pointing out. I was confused with truncation and
alignment. I wanted to say here...

>> > 
>> > > we will get tsz = 0. It is of course not an expected result.
>> 
>> We won't always get "tsz=0", just get the lower 32 bit value.

truncating the upper 32 bits and just getting the lower 32 bits as you
say.

> 
> OK, didn't get you still have saying in next paragraph. Ignore this
> please.
>
>> 
>> > > 
>> > > During our tests there are two problems caused by it:
>> > > 1) read_vmcore will refuse to continue so makedumpfile fails.
>> > > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
>> > >
>> > 
>> > we reach these errors.
>> > 
>> > If (m->offset + m->size - *fpos) is not aligned with 4GB,
>> > read_vmcore() or mmap_vmcore() is performed with the truncated
>> > non-zero value as size (of course, this is also not expected value but
>> > the execution doesn't result in error). Then, fpos proceeds so that
>> > (m->offset + m->size - *fpos) is aligned with 4GB in the next loop,
>> > and we after all reach the errors.
>> > 
>> > I think your patch description needs a bit more detail.
>> > 
>> > It seems good to me that the patch itself.
>> > 
>> > > Use unsigned long long in min_t instead so that the variables are not
>> > > truncated.
>> > > 
>> > > Signed-off-by: Baoquan He <b...@redhat.com>
>> > > Signed-off-by: Dave Young <dyo...@redhat.com>
>> > > ---
>> > > v1->v2: spelling fix in patch log
>> > >  fs/proc/vmcore.c |7 +--
>> > >  1 file changed, 5 insertions(+), 2 deletions(-)
>> > > 
>> > > --- linux-x86.orig/fs/proc/vmcore.c
>> > > +++ linux-x86/fs/proc/vmcore.c
>> > > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe
>> > >  
>> > >  list_for_each_entry(m, _list, list) {
>> > >  if (*fpos < m->offset + m->size) {
>> > > -tsz = min_t(size_t, m->offset + m->size - 
>> > > *fpos, buflen);
>> > > +tsz = (size_t)min_t(unsigned long long,
>> > > +m->offset + m->size - *fpos,
>> > > +buflen);
>> > >  start = m->paddr + *fpos - m->offset;
>> > >      tmp = read_from_oldmem(buffer, tsz, , 
>> > > userbuf);
>> > >  if (tmp < 0)
>> > > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file
>> > >  if (start < m->offset + m->size) {
>> > >  u64 paddr = 0;
>> > >  
>> > > -tsz = min_t(size_t, m->offset + m->size - 
>> > > start, size);
>> > > +tsz = (size_t)min_t(unsigned long long,
>> > > +m->offset + m->size - 
>> > > start, size);
>> > >  paddr = m->paddr + start - m->offset;
>> > >  if (vmcore_remap_oldmem_pfn(vma, vma->vm_start 
>> > > + len,
>> > >  paddr >> 
>> > > PAGE_SHIFT, tsz,
>> > > 
>> > --
>> > Thanks.
>> > HATAYAMA, Daisuke
> 
--
Thanks.
HATAYAMA, Daisuke


Re: [PATCH V2] proc-vmcore: wrong data type casting fix

2016-03-13 Thread HATAYAMA Daisuke
From: Baoquan He 
Subject: Re: [PATCH V2] proc-vmcore: wrong data type casting fix
Date: Mon, 14 Mar 2016 11:50:53 +0800

> On 03/14/16 at 11:47am, Baoquan He wrote:
>> On 03/14/16 at 12:25pm, HATAYAMA Daisuke wrote:
>> > From: Dave Young 
>> > Subject: [PATCH V2] proc-vmcore: wrong data type casting fix
>> > Date: Fri, 11 Mar 2016 16:42:48 +0800
>> > 
>> > > On i686 PAE enabled machine the contiguous physical area could be large
>> > > and it can cause trimming down variables in below calculation in
>> > > read_vmcore() and mmap_vmcore():
>> > > 
>> > >  tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
>> > > 
>> > > Then the real size passed down is not correct any more.
>> > > Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then
>> > 
>> > That is, size_t and loff_t are defined as follows on i686:
>> > 
>> > (gdb) ptype size_t
>> > type = unsigned int
>> > (gdb) ptype loff_t
>> > type = long long int
>> > 
>> > So casting by size_t means truncating a given value by 4GB.
>> > 
>> > Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB,
>> > and is aligned with 4GB, the resulted value is 0, and
>> 
>> Truncating doesn't mean align. If (m->offset + m->size - *fpos) is
>> larger than 4G, E.g 0x1000f, the truncating result will be 0xf.
>>

Thanks for pointing out. I was confused with truncation and
alignment. I wanted to say here...

>> > 
>> > > we will get tsz = 0. It is of course not an expected result.
>> 
>> We won't always get "tsz=0", just get the lower 32 bit value.

truncating the upper 32 bits and just getting the lower 32 bits as you
say.

> 
> OK, didn't get you still have saying in next paragraph. Ignore this
> please.
>
>> 
>> > > 
>> > > During our tests there are two problems caused by it:
>> > > 1) read_vmcore will refuse to continue so makedumpfile fails.
>> > > 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
>> > >
>> > 
>> > we reach these errors.
>> > 
>> > If (m->offset + m->size - *fpos) is not aligned with 4GB,
>> > read_vmcore() or mmap_vmcore() is performed with the truncated
>> > non-zero value as size (of course, this is also not expected value but
>> > the execution doesn't result in error). Then, fpos proceeds so that
>> > (m->offset + m->size - *fpos) is aligned with 4GB in the next loop,
>> > and we after all reach the errors.
>> > 
>> > I think your patch description needs a bit more detail.
>> > 
>> > It seems good to me that the patch itself.
>> > 
>> > > Use unsigned long long in min_t instead so that the variables are not
>> > > truncated.
>> > > 
>> > > Signed-off-by: Baoquan He 
>> > > Signed-off-by: Dave Young 
>> > > ---
>> > > v1->v2: spelling fix in patch log
>> > >  fs/proc/vmcore.c |7 +--
>> > >  1 file changed, 5 insertions(+), 2 deletions(-)
>> > > 
>> > > --- linux-x86.orig/fs/proc/vmcore.c
>> > > +++ linux-x86/fs/proc/vmcore.c
>> > > @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe
>> > >  
>> > >  list_for_each_entry(m, _list, list) {
>> > >  if (*fpos < m->offset + m->size) {
>> > > -tsz = min_t(size_t, m->offset + m->size - 
>> > > *fpos, buflen);
>> > > +tsz = (size_t)min_t(unsigned long long,
>> > > +m->offset + m->size - *fpos,
>> > > +buflen);
>> > >  start = m->paddr + *fpos - m->offset;
>> > >      tmp = read_from_oldmem(buffer, tsz, , 
>> > > userbuf);
>> > >  if (tmp < 0)
>> > > @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file
>> > >  if (start < m->offset + m->size) {
>> > >  u64 paddr = 0;
>> > >  
>> > > -tsz = min_t(size_t, m->offset + m->size - 
>> > > start, size);
>> > > +tsz = (size_t)min_t(unsigned long long,
>> > > +m->offset + m->size - 
>> > > start, size);
>> > >  paddr = m->paddr + start - m->offset;
>> > >  if (vmcore_remap_oldmem_pfn(vma, vma->vm_start 
>> > > + len,
>> > >  paddr >> 
>> > > PAGE_SHIFT, tsz,
>> > > 
>> > --
>> > Thanks.
>> > HATAYAMA, Daisuke
> 
--
Thanks.
HATAYAMA, Daisuke


Re: [PATCH V2] proc-vmcore: wrong data type casting fix

2016-03-13 Thread HATAYAMA Daisuke
From: Dave Young <dyo...@redhat.com>
Subject: [PATCH V2] proc-vmcore: wrong data type casting fix
Date: Fri, 11 Mar 2016 16:42:48 +0800

> On i686 PAE enabled machine the contiguous physical area could be large
> and it can cause trimming down variables in below calculation in
> read_vmcore() and mmap_vmcore():
> 
>   tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> 
> Then the real size passed down is not correct any more.
> Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then

That is, size_t and loff_t are defined as follows on i686:

(gdb) ptype size_t
type = unsigned int
(gdb) ptype loff_t
type = long long int

So casting by size_t means truncating a given value by 4GB.

Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB,
and is aligned with 4GB, the resulted value is 0, and

> we will get tsz = 0. It is of course not an expected result.
> 
> During our tests there are two problems caused by it:
> 1) read_vmcore will refuse to continue so makedumpfile fails.
> 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
>

we reach these errors.

If (m->offset + m->size - *fpos) is not aligned with 4GB,
read_vmcore() or mmap_vmcore() is performed with the truncated
non-zero value as size (of course, this is also not expected value but
the execution doesn't result in error). Then, fpos proceeds so that
(m->offset + m->size - *fpos) is aligned with 4GB in the next loop,
and we after all reach the errors.

I think your patch description needs a bit more detail.

It seems good to me that the patch itself.

> Use unsigned long long in min_t instead so that the variables are not
> truncated.
> 
> Signed-off-by: Baoquan He <b...@redhat.com>
> Signed-off-by: Dave Young <dyo...@redhat.com>
> ---
> v1->v2: spelling fix in patch log
>  fs/proc/vmcore.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> --- linux-x86.orig/fs/proc/vmcore.c
> +++ linux-x86/fs/proc/vmcore.c
> @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe
>  
>   list_for_each_entry(m, _list, list) {
>   if (*fpos < m->offset + m->size) {
> - tsz = min_t(size_t, m->offset + m->size - *fpos, 
> buflen);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - *fpos,
> + buflen);
>   start = m->paddr + *fpos - m->offset;
>   tmp = read_from_oldmem(buffer, tsz, , userbuf);
>   if (tmp < 0)
> @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file
>   if (start < m->offset + m->size) {
>   u64 paddr = 0;
>  
> - tsz = min_t(size_t, m->offset + m->size - start, size);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - start, size);
>   paddr = m->paddr + start - m->offset;
>   if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
>   paddr >> PAGE_SHIFT, tsz,
> 
--
Thanks.
HATAYAMA, Daisuke


Re: [PATCH V2] proc-vmcore: wrong data type casting fix

2016-03-13 Thread HATAYAMA Daisuke
From: Dave Young 
Subject: [PATCH V2] proc-vmcore: wrong data type casting fix
Date: Fri, 11 Mar 2016 16:42:48 +0800

> On i686 PAE enabled machine the contiguous physical area could be large
> and it can cause trimming down variables in below calculation in
> read_vmcore() and mmap_vmcore():
> 
>   tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> 
> Then the real size passed down is not correct any more.
> Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then

That is, size_t and loff_t are defined as follows on i686:

(gdb) ptype size_t
type = unsigned int
(gdb) ptype loff_t
type = long long int

So casting by size_t means truncating a given value by 4GB.

Then, if (m->offset + m->size - *fpos) is equal to or larger than 4GB,
and is aligned with 4GB, the resulted value is 0, and

> we will get tsz = 0. It is of course not an expected result.
> 
> During our tests there are two problems caused by it:
> 1) read_vmcore will refuse to continue so makedumpfile fails.
> 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
>

we reach these errors.

If (m->offset + m->size - *fpos) is not aligned with 4GB,
read_vmcore() or mmap_vmcore() is performed with the truncated
non-zero value as size (of course, this is also not expected value but
the execution doesn't result in error). Then, fpos proceeds so that
(m->offset + m->size - *fpos) is aligned with 4GB in the next loop,
and we after all reach the errors.

I think your patch description needs a bit more detail.

It seems good to me that the patch itself.

> Use unsigned long long in min_t instead so that the variables are not
> truncated.
> 
> Signed-off-by: Baoquan He 
> Signed-off-by: Dave Young 
> ---
> v1->v2: spelling fix in patch log
>  fs/proc/vmcore.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> --- linux-x86.orig/fs/proc/vmcore.c
> +++ linux-x86/fs/proc/vmcore.c
> @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe
>  
>   list_for_each_entry(m, _list, list) {
>   if (*fpos < m->offset + m->size) {
> - tsz = min_t(size_t, m->offset + m->size - *fpos, 
> buflen);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - *fpos,
> + buflen);
>   start = m->paddr + *fpos - m->offset;
>   tmp = read_from_oldmem(buffer, tsz, , userbuf);
>   if (tmp < 0)
> @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file
>   if (start < m->offset + m->size) {
>   u64 paddr = 0;
>  
> - tsz = min_t(size_t, m->offset + m->size - start, size);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - start, size);
>       paddr = m->paddr + start - m->offset;
>   if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
>   paddr >> PAGE_SHIFT, tsz,
> 
--
Thanks.
HATAYAMA, Daisuke


[PATCH v3 0/2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-05-15 Thread HATAYAMA Daisuke
This patch set fixes a bug introduced in the commit
f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45.

For the v2 discussion, please follow the following thread:

  http://thread.gmane.org/gmane.linux.kernel/1902488

The v3 patch newly adds a patch suggested by Ingo Molnar.

---

HATAYAMA Daisuke (2):
  kernel/panic: call the 2nd crash_kexec() only if 
crash_kexec_post_notifiers is enabled
  kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops 
path


 include/linux/kernel.h |3 +++
 kernel/kexec.c |   11 +++
 kernel/panic.c |5 +++--
 3 files changed, 17 insertions(+), 2 deletions(-)

--
Thanks.
HATAYAMA, Daisuke
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-05-15 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles
wheather panic() calls crash_kexec() before panic_notifiers and dump
kmsg or after.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
"crash_kexec_post_notifiers" in the condition of kexec_should_crash().

Also, put a comment in kexec_should_crash() to explain not obvious
things on this patch.

Signed-off-by: HATAYAMA Daisuke 
Acked-by: Baoquan He 
Tested-by: Hidehiro Kawai 
Reviewed-by: Masami Hiramatsu 
---
 include/linux/kernel.h |3 +++
 kernel/kexec.c |   11 +++
 kernel/panic.c |2 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3a5b48e..bd5331c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -438,6 +438,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7a36fdc..a785c10 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,17 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   /*
+* If crash_kexec_post_notifiers is enabled, don't run
+* crash_kexec() here yet, which must be run after panic
+* notifiers in panic().
+*/
+   if (crash_kexec_post_notifiers)
+   return 0;
+   /*
+* There are 4 panic() calls in do_exit() path, each of which
+* corresponds to each of these 4 conditions.
+*/
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 774614f..04e91ff 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled

2015-05-15 Thread HATAYAMA Daisuke
For compatibility with the behaviour before the commit
f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45, the 2nd crash_kexec() should
be called only if crash_kexec_post_notifiers is enabled.

Note that crash_kexec() returns immediately if kdump crash kernel is
not loaded, so in this case, this patch makes no functionality change,
but the point is to make it explicit, from the caller panic() side,
that the 2nd crash_kexec() does nothing.

Signed-off-by: HATAYAMA Daisuke 
Suggested-by: Ingo Molnar 
---
 kernel/panic.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index 8136ad7..774614f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -142,7 +142,8 @@ void panic(const char *fmt, ...)
 * Note: since some panic_notifiers can make crashed kernel
 * more unstable, it can increase risks of the kdump failure too.
 */
-   crash_kexec(NULL);
+   if (crash_kexec_post_notifiers)
+   crash_kexec(NULL);
 
bust_spinlocks(0);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] kernel/panic: call the 2nd crash_kexec() only if crash_kexec_post_notifiers is enabled

2015-05-15 Thread HATAYAMA Daisuke
For compatibility with the behaviour before the commit
f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45, the 2nd crash_kexec() should
be called only if crash_kexec_post_notifiers is enabled.

Note that crash_kexec() returns immediately if kdump crash kernel is
not loaded, so in this case, this patch makes no functionality change,
but the point is to make it explicit, from the caller panic() side,
that the 2nd crash_kexec() does nothing.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Suggested-by: Ingo Molnar mi...@kernel.org
---
 kernel/panic.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index 8136ad7..774614f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -142,7 +142,8 @@ void panic(const char *fmt, ...)
 * Note: since some panic_notifiers can make crashed kernel
 * more unstable, it can increase risks of the kdump failure too.
 */
-   crash_kexec(NULL);
+   if (crash_kexec_post_notifiers)
+   crash_kexec(NULL);
 
bust_spinlocks(0);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-05-15 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
crash_kexec_post_notifiers kernel boot option, which toggles
wheather panic() calls crash_kexec() before panic_notifiers and dump
kmsg or after.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
crash_kexec_post_notifiers in the condition of kexec_should_crash().

Also, put a comment in kexec_should_crash() to explain not obvious
things on this patch.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Acked-by: Baoquan He b...@redhat.com
Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com
Reviewed-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com
---
 include/linux/kernel.h |3 +++
 kernel/kexec.c |   11 +++
 kernel/panic.c |2 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3a5b48e..bd5331c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -438,6 +438,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7a36fdc..a785c10 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,17 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   /*
+* If crash_kexec_post_notifiers is enabled, don't run
+* crash_kexec() here yet, which must be run after panic
+* notifiers in panic().
+*/
+   if (crash_kexec_post_notifiers)
+   return 0;
+   /*
+* There are 4 panic() calls in do_exit() path, each of which
+* corresponds to each of these 4 conditions.
+*/
if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 774614f..04e91ff 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-05-15 Thread HATAYAMA Daisuke
This patch set fixes a bug introduced in the commit
f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45.

For the v2 discussion, please follow the following thread:

  http://thread.gmane.org/gmane.linux.kernel/1902488

The v3 patch newly adds a patch suggested by Ingo Molnar.

---

HATAYAMA Daisuke (2):
  kernel/panic: call the 2nd crash_kexec() only if 
crash_kexec_post_notifiers is enabled
  kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops 
path


 include/linux/kernel.h |3 +++
 kernel/kexec.c |   11 +++
 kernel/panic.c |5 +++--
 3 files changed, 17 insertions(+), 2 deletions(-)

--
Thanks.
HATAYAMA, Daisuke
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-03-06 Thread Hatayama, Daisuke/畑山 大輔
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles
wheather panic() calls crash_kexec() before panic_notifiers and dump
kmsg or after.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
"crash_kexec_post_notifiers" in the condition of kexec_should_crash().

Also, put a comment in kexec_should_crash() to explain not obvious
things on this patch.

Signed-off-by: HATAYAMA Daisuke 
Acked-by: Baoquan He 
Tested-by: Hidehiro Kawai 
Reviewed-by: Masami Hiramatsu 
---
 include/linux/kernel.h |  3 +++
 kernel/kexec.c | 11 +++
 kernel/panic.c |  2 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d6d630d..07483c7 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 38c25b1..5bf6077 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,17 @@ struct resource crashk_low_res = {

 int kexec_should_crash(struct task_struct *p)
 {
+   /*
+* If crash_kexec_post_notifiers is enabled, don't run
+* crash_kexec() here yet, which must be run after panic
+* notifiers in panic().
+*/
+   if (crash_kexec_post_notifiers)
+   return 0;
+   /*
+* There are 4 panic() calls in do_exit() path, each of which
+* calls corresponds to each of these 4 conditions.
+*/
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 8136ad7..79ca912 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;

 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-03-06 Thread Hatayama, Daisuke/畑山 大輔
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
crash_kexec_post_notifiers kernel boot option, which toggles
wheather panic() calls crash_kexec() before panic_notifiers and dump
kmsg or after.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
crash_kexec_post_notifiers in the condition of kexec_should_crash().

Also, put a comment in kexec_should_crash() to explain not obvious
things on this patch.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Acked-by: Baoquan He b...@redhat.com
Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com
Reviewed-by: Masami Hiramatsu masami.hiramatsu...@hitachi.com
---
 include/linux/kernel.h |  3 +++
 kernel/kexec.c | 11 +++
 kernel/panic.c |  2 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d6d630d..07483c7 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 38c25b1..5bf6077 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,17 @@ struct resource crashk_low_res = {

 int kexec_should_crash(struct task_struct *p)
 {
+   /*
+* If crash_kexec_post_notifiers is enabled, don't run
+* crash_kexec() here yet, which must be run after panic
+* notifiers in panic().
+*/
+   if (crash_kexec_post_notifiers)
+   return 0;
+   /*
+* There are 4 panic() calls in do_exit() path, each of which
+* calls corresponds to each of these 4 conditions.
+*/
if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 8136ad7..79ca912 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;

 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-03-05 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [RESEND PATCH] kernel/panic/kexec: fix 
"crash_kexec_post_notifiers" option issue in oops path
Date: Thu, 5 Mar 2015 17:22:04 -0500

> On Thu, Mar 05, 2015 at 05:19:30PM -0500, Vivek Goyal wrote:
>> On Wed, Mar 04, 2015 at 05:56:48PM +0900, HATAYAMA Daisuke wrote:
>> > The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
>> > "crash_kexec_post_notifiers" kernel boot option, which toggles
>> > wheather panic() calls crash_kexec() before or after panic_notifiers
>> > and dump kmsg.
>> > 
>> > The problem is that the commit overlooks panic_on_oops kernel boot
>> > option. If it is enabled, crash_kexec() is called directly without
>> > going through panic() in oops path.
>> > 
>> > To fix this issue, this patch adds a check to
>> > "crash_kexec_post_notifiers" in the condition of kexec_should_crash().
>> > 
>> > Signed-off-by: HATAYAMA Daisuke 
>> > Acked-by: Baoquan He 
>> > Tested-by: Hidehiro Kawai 
>> > ---
>> >  include/linux/kernel.h | 3 +++
>> >  kernel/kexec.c | 2 ++
>> >  kernel/panic.c | 2 +-
>> >  3 files changed, 6 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>> > index 64ce58b..f47379f 100644
>> > --- a/include/linux/kernel.h
>> > +++ b/include/linux/kernel.h
>> > @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
>> >  extern int panic_on_io_nmi;
>> >  extern int panic_on_warn;
>> >  extern int sysctl_panic_on_stackoverflow;
>> > +
>> > +extern bool crash_kexec_post_notifiers;
>> > +
>> >  /*
>> >   * Only to be used by arch init code. If the user over-wrote the default
>> >   * CONFIG_PANIC_TIMEOUT, honor it.
>> > diff --git a/kernel/kexec.c b/kernel/kexec.c
>> > index 9a8a01a..0ecf252 100644
>> > --- a/kernel/kexec.c
>> > +++ b/kernel/kexec.c
>> > @@ -84,6 +84,8 @@ struct resource crashk_low_res = {
>> >  
>> >  int kexec_should_crash(struct task_struct *p)
>> >  {
>> > +  if (crash_kexec_post_notifiers)
>> > +  return 0;
>> 
>> This is little confusing. So if crash_kexec_post_notifiers is set but
>> panic_on_oops is not set, still we will return?
>> 
>> Should we do this only if panic_on_oops is set? IOW, how about following
>> 
>>  if (panic_on_oops && crash_kexec_post_notifiers)
>>  return 0;
>> 
>> And then also put a comment explaining the rationale.
> 
> Ok, I went through the previous version of patch and discussion there
> which says that all the 4 conditions lead to panic. So putting above
> code should be fine.
> 
> Can you please atleast put a comment here to explain it as it was not
> obvious. Just mention that all the checks below lead to panic hence
> if user wants to run panic notifiers then don't run crash_kexec() yet.
> It will be run after panic notifiers.
> 

Thanks for your reviewing.

Yes, I'll put such new comment in the patch of next version.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-03-05 Thread HATAYAMA Daisuke
From: Vivek Goyal vgo...@redhat.com
Subject: Re: [RESEND PATCH] kernel/panic/kexec: fix 
crash_kexec_post_notifiers option issue in oops path
Date: Thu, 5 Mar 2015 17:22:04 -0500

 On Thu, Mar 05, 2015 at 05:19:30PM -0500, Vivek Goyal wrote:
 On Wed, Mar 04, 2015 at 05:56:48PM +0900, HATAYAMA Daisuke wrote:
  The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
  crash_kexec_post_notifiers kernel boot option, which toggles
  wheather panic() calls crash_kexec() before or after panic_notifiers
  and dump kmsg.
  
  The problem is that the commit overlooks panic_on_oops kernel boot
  option. If it is enabled, crash_kexec() is called directly without
  going through panic() in oops path.
  
  To fix this issue, this patch adds a check to
  crash_kexec_post_notifiers in the condition of kexec_should_crash().
  
  Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
  Acked-by: Baoquan He b...@redhat.com
  Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com
  ---
   include/linux/kernel.h | 3 +++
   kernel/kexec.c | 2 ++
   kernel/panic.c | 2 +-
   3 files changed, 6 insertions(+), 1 deletion(-)
  
  diff --git a/include/linux/kernel.h b/include/linux/kernel.h
  index 64ce58b..f47379f 100644
  --- a/include/linux/kernel.h
  +++ b/include/linux/kernel.h
  @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
   extern int panic_on_io_nmi;
   extern int panic_on_warn;
   extern int sysctl_panic_on_stackoverflow;
  +
  +extern bool crash_kexec_post_notifiers;
  +
   /*
* Only to be used by arch init code. If the user over-wrote the default
* CONFIG_PANIC_TIMEOUT, honor it.
  diff --git a/kernel/kexec.c b/kernel/kexec.c
  index 9a8a01a..0ecf252 100644
  --- a/kernel/kexec.c
  +++ b/kernel/kexec.c
  @@ -84,6 +84,8 @@ struct resource crashk_low_res = {
   
   int kexec_should_crash(struct task_struct *p)
   {
  +  if (crash_kexec_post_notifiers)
  +  return 0;
 
 This is little confusing. So if crash_kexec_post_notifiers is set but
 panic_on_oops is not set, still we will return?
 
 Should we do this only if panic_on_oops is set? IOW, how about following
 
  if (panic_on_oops  crash_kexec_post_notifiers)
  return 0;
 
 And then also put a comment explaining the rationale.
 
 Ok, I went through the previous version of patch and discussion there
 which says that all the 4 conditions lead to panic. So putting above
 code should be fine.
 
 Can you please atleast put a comment here to explain it as it was not
 obvious. Just mention that all the checks below lead to panic hence
 if user wants to run panic notifiers then don't run crash_kexec() yet.
 It will be run after panic notifiers.
 

Thanks for your reviewing.

Yes, I'll put such new comment in the patch of next version.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-03-04 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
"crash_kexec_post_notifiers" in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke 
Acked-by: Baoquan He 
Tested-by: Hidehiro Kawai 
---
 include/linux/kernel.h | 3 +++
 kernel/kexec.c | 2 ++
 kernel/panic.c | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-03-04 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
crash_kexec_post_notifiers kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
crash_kexec_post_notifiers in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Acked-by: Baoquan He b...@redhat.com
Tested-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com
---
 include/linux/kernel.h | 3 +++
 kernel/kexec.c | 2 ++
 kernel/panic.c | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-02-20 Thread HATAYAMA, Daisuke

Hello Eric, Vivek,

Do you have any comment to this patch?

On 2015/02/05 17:59, HATAYAMA Daisuke wrote:

The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
"crash_kexec_post_notifiers" in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke 
---
  include/linux/kernel.h | 3 +++
  kernel/kexec.c | 2 ++
  kernel/panic.c | 2 +-
  3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
  extern int panic_on_io_nmi;
  extern int panic_on_warn;
  extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
  /*
   * Only to be used by arch init code. If the user over-wrote the default
   * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {

  int kexec_should_crash(struct task_struct *p)
  {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
  static int pause_on_oops;
  static int pause_on_oops_flag;
  static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
  int panic_on_warn __read_mostly;

  int panic_timeout = CONFIG_PANIC_TIMEOUT;



--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-02-20 Thread HATAYAMA, Daisuke

Hello Eric, Vivek,

Do you have any comment to this patch?

On 2015/02/05 17:59, HATAYAMA Daisuke wrote:

The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
crash_kexec_post_notifiers kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
crash_kexec_post_notifiers in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
  include/linux/kernel.h | 3 +++
  kernel/kexec.c | 2 ++
  kernel/panic.c | 2 +-
  3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
  extern int panic_on_io_nmi;
  extern int panic_on_warn;
  extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
  /*
   * Only to be used by arch init code. If the user over-wrote the default
   * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {

  int kexec_should_crash(struct task_struct *p)
  {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
  static int pause_on_oops;
  static int pause_on_oops_flag;
  static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
  int panic_on_warn __read_mostly;

  int panic_timeout = CONFIG_PANIC_TIMEOUT;



--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-02-08 Thread HATAYAMA Daisuke
Hello,

From: Baoquan He 
Subject: Re: [PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" 
option issue in oops path
Date: Mon, 9 Feb 2015 10:40:30 +0800

> On 02/05/15 at 05:59pm, HATAYAMA Daisuke wrote:
>> The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
>> "crash_kexec_post_notifiers" kernel boot option, which toggles
>> wheather panic() calls crash_kexec() before or after panic_notifiers
>> and dump kmsg.
>> 
>> The problem is that the commit overlooks panic_on_oops kernel boot
>> option. If it is enabled, crash_kexec() is called directly without
>> going through panic() in oops path.
>> 
>> To fix this issue, this patch adds a check to
>> "crash_kexec_post_notifiers" in the condition of kexec_should_crash().
>> 
>> Signed-off-by: HATAYAMA Daisuke 
>> ---
>>  include/linux/kernel.h | 3 +++
>>  kernel/kexec.c | 2 ++
>>  kernel/panic.c | 2 +-
>>  3 files changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>> index 64ce58b..f47379f 100644
>> --- a/include/linux/kernel.h
>> +++ b/include/linux/kernel.h
>> @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
>>  extern int panic_on_io_nmi;
>>  extern int panic_on_warn;
>>  extern int sysctl_panic_on_stackoverflow;
>> +
>> +extern bool crash_kexec_post_notifiers;
>> +
>>  /*
>>   * Only to be used by arch init code. If the user over-wrote the default
>>   * CONFIG_PANIC_TIMEOUT, honor it.
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index 9a8a01a..0ecf252 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -84,6 +84,8 @@ struct resource crashk_low_res = {
>>  
>>  int kexec_should_crash(struct task_struct *p)
>>  {
>> +if (crash_kexec_post_notifiers)
>> +return 0;
>>  if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
>>  return 1;
> 
> What if these two conditions !p->pid || is_global_init(p) are satisfied?
> Seems the behavious is changed.
> 

Please further follow do_exit() path. For each condition, there are
the corresponding panic() calls. In summary:

  oops_end
1) panic() for in_interrupt()
2) panic() for panic_on_oops
do_exit
  3) panic() for !p->pid (idle task)
  exit_notify
forget_original_parent
  find_child_reaper
4) panic() for p->pid == 1 (init task)

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-02-08 Thread HATAYAMA Daisuke
Hello,

From: Baoquan He b...@redhat.com
Subject: Re: [PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers 
option issue in oops path
Date: Mon, 9 Feb 2015 10:40:30 +0800

 On 02/05/15 at 05:59pm, HATAYAMA Daisuke wrote:
 The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
 crash_kexec_post_notifiers kernel boot option, which toggles
 wheather panic() calls crash_kexec() before or after panic_notifiers
 and dump kmsg.
 
 The problem is that the commit overlooks panic_on_oops kernel boot
 option. If it is enabled, crash_kexec() is called directly without
 going through panic() in oops path.
 
 To fix this issue, this patch adds a check to
 crash_kexec_post_notifiers in the condition of kexec_should_crash().
 
 Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
 ---
  include/linux/kernel.h | 3 +++
  kernel/kexec.c | 2 ++
  kernel/panic.c | 2 +-
  3 files changed, 6 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/kernel.h b/include/linux/kernel.h
 index 64ce58b..f47379f 100644
 --- a/include/linux/kernel.h
 +++ b/include/linux/kernel.h
 @@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
  extern int panic_on_io_nmi;
  extern int panic_on_warn;
  extern int sysctl_panic_on_stackoverflow;
 +
 +extern bool crash_kexec_post_notifiers;
 +
  /*
   * Only to be used by arch init code. If the user over-wrote the default
   * CONFIG_PANIC_TIMEOUT, honor it.
 diff --git a/kernel/kexec.c b/kernel/kexec.c
 index 9a8a01a..0ecf252 100644
 --- a/kernel/kexec.c
 +++ b/kernel/kexec.c
 @@ -84,6 +84,8 @@ struct resource crashk_low_res = {
  
  int kexec_should_crash(struct task_struct *p)
  {
 +if (crash_kexec_post_notifiers)
 +return 0;
  if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
  return 1;
 
 What if these two conditions !p-pid || is_global_init(p) are satisfied?
 Seems the behavious is changed.
 

Please further follow do_exit() path. For each condition, there are
the corresponding panic() calls. In summary:

  oops_end
1) panic() for in_interrupt()
2) panic() for panic_on_oops
do_exit
  3) panic() for !p-pid (idle task)
  exit_notify
forget_original_parent
  find_child_reaper
4) panic() for p-pid == 1 (init task)

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path

2015-02-05 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
"crash_kexec_post_notifiers" in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke 
---
 include/linux/kernel.h | 3 +++
 kernel/kexec.c | 2 ++
 kernel/panic.c | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel/panic/kexec: fix crash_kexec_post_notifiers option issue in oops path

2015-02-05 Thread HATAYAMA Daisuke
The commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 introduced
crash_kexec_post_notifiers kernel boot option, which toggles
wheather panic() calls crash_kexec() before or after panic_notifiers
and dump kmsg.

The problem is that the commit overlooks panic_on_oops kernel boot
option. If it is enabled, crash_kexec() is called directly without
going through panic() in oops path.

To fix this issue, this patch adds a check to
crash_kexec_post_notifiers in the condition of kexec_should_crash().

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 include/linux/kernel.h | 3 +++
 kernel/kexec.c | 2 ++
 kernel/panic.c | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 64ce58b..f47379f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -426,6 +426,9 @@ extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9a8a01a..0ecf252 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -84,6 +84,8 @@ struct resource crashk_low_res = {
 
 int kexec_should_crash(struct task_struct *p)
 {
+   if (crash_kexec_post_notifiers)
+   return 0;
if (in_interrupt() || !p-pid || is_global_init(p) || panic_on_oops)
return 1;
return 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..6582546 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -32,7 +32,7 @@ static unsigned long tainted_mask;
 static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
-static bool crash_kexec_post_notifiers;
+bool crash_kexec_post_notifiers;
 int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
-- 
1.9.3


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-16 Thread HATAYAMA Daisuke
From: Petr Tesarik 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Fri, 14 Nov 2014 13:36:10 +0100

> On Fri, 14 Nov 2014 18:54:23 +0900 (JST)
> HATAYAMA Daisuke  wrote:
> 
>> From: Petr Tesarik 
>> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> VMCOREINFO
>> Date: Fri, 14 Nov 2014 09:31:45 +0100
>> 
>> > On Fri, 14 Nov 2014 10:42:35 +0900 (JST)
>> > HATAYAMA Daisuke  wrote:
>> > 
>> >> From: Petr Tesarik 
>> >> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> >> VMCOREINFO
>> >> Date: Thu, 13 Nov 2014 15:48:10 +0100
>> >> 
>> >> > On Thu, 13 Nov 2014 09:25:48 -0500
>> >> > Vivek Goyal  wrote:
>> >> > 
>> >> >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
>> >> >> > 
>> >> >> > (2014/11/13 17:06), Petr Tesarik wrote:
>> >> >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
>> >> >> > >HATAYAMA Daisuke  wrote:
>> >> >> > >
>> >> >> > >>From: Vivek Goyal 
>> >> >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base 
>> >> >> > >>in VMCOREINFO
>> >> >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500
>> >> >> > >>
>> >> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> >> >> > >>>>Currently, VMCOREINFO note information reports the virtual 
>> >> >> > >>>>address of
>> >> >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't 
>> >> >> > >>>>make
>> >> >> > >>>>sense because to refer to value of the phys_base, it's necessary 
>> >> >> > >>>>to
>> >> >> > >>>>get the value of phys_base itself we are now about to refer to.
>> >> >> > >>>>
>> >> >> > >>>
>> >> >> > >>>Hi Hatayama,
>> >> >> > >>>
>> >> >> > >>>/proc/vmcore ELF headers have virtual address information and 
>> >> >> > >>>using
>> >> >> > >>>that you should be able to read actual value of phys_base. gdb 
>> >> >> > >>>deals
>> >> >> > >>>with virtual addresses all the time and can read value of any 
>> >> >> > >>>symbol
>> >> >> > >>>using those headers.
>> >> >> > >>>
>> >> >> > >>>So I am not sure what's the need for exporting actual value of
>> >> >> > >>>phys_base.
>> >> >> > >>>
>> >> >> > >>
>> >> >> > >>Sorry, my logic in the patch description was wrong. For 
>> >> >> > >>/proc/vmcore,
>> >> >> > >>there's enough information for makedumpdile to get phys_base. It's
>> >> >> > >>correct. The problem here is that other crash dump mechanisms that 
>> >> >> > >>run
>> >> >> > >>outside Linux kernel independently don't have information to get
>> >> >> > >>phys_base.
>> >> >> > >
>> >> >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, 
>> >> >> > >will
>> >> >> > >they?
>> >> >> > >
>> >> >> > 
>> >> >> > I don't intend such sophisticated function only by VMCOREINFO.
>> >> >> > Search vmcore for VMCOREINFO using strings + grep before opening it 
>> >> >> > by crash.
>> >> >> > I intend that only here.
>> >> >> 
>> >> >> I think this is very crude and not proper way to get to vmcoreinfo.
>> >> > 
>> >> > Same here. If VMCOREINFO must be locatable without communicating any
>> >> > information to the hypervisor, then I would rather go for something
>> >> > similar to what s390(x) folks do - a well-known location in physical
>> >> > memory that contains a pointer to a checksummed OS info struct

Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-16 Thread HATAYAMA Daisuke
From: Petr Tesarik ptesa...@suse.cz
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Fri, 14 Nov 2014 13:36:10 +0100

 On Fri, 14 Nov 2014 18:54:23 +0900 (JST)
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
 From: Petr Tesarik ptesa...@suse.cz
 Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
 VMCOREINFO
 Date: Fri, 14 Nov 2014 09:31:45 +0100
 
  On Fri, 14 Nov 2014 10:42:35 +0900 (JST)
  HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
  
  From: Petr Tesarik ptesa...@suse.cz
  Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
  VMCOREINFO
  Date: Thu, 13 Nov 2014 15:48:10 +0100
  
   On Thu, 13 Nov 2014 09:25:48 -0500
   Vivek Goyal vgo...@redhat.com wrote:
   
   On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:

(2014/11/13 17:06), Petr Tesarik wrote:
On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base 
in VMCOREINFO
Date: Wed, 12 Nov 2014 17:12:05 -0500

On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
Currently, VMCOREINFO note information reports the virtual 
address of
phys_base that is assigned to symbol phys_base. But this doesn't 
make
sense because to refer to value of the phys_base, it's necessary 
to
get the value of phys_base itself we are now about to refer to.


Hi Hatayama,

/proc/vmcore ELF headers have virtual address information and 
using
that you should be able to read actual value of phys_base. gdb 
deals
with virtual addresses all the time and can read value of any 
symbol
using those headers.

So I am not sure what's the need for exporting actual value of
phys_base.


Sorry, my logic in the patch description was wrong. For 
/proc/vmcore,
there's enough information for makedumpdile to get phys_base. It's
correct. The problem here is that other crash dump mechanisms that 
run
outside Linux kernel independently don't have information to get
phys_base.

Yes, but these mechanisms won't be able to read VMCOREINFO either, 
will
they?


I don't intend such sophisticated function only by VMCOREINFO.
Search vmcore for VMCOREINFO using strings + grep before opening it 
by crash.
I intend that only here.
   
   I think this is very crude and not proper way to get to vmcoreinfo.
   
   Same here. If VMCOREINFO must be locatable without communicating any
   information to the hypervisor, then I would rather go for something
   similar to what s390(x) folks do - a well-known location in physical
   memory that contains a pointer to a checksummed OS info structure,
   which in turn contains the VMCOREINFO pointers.
   
   I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
   Or is that part of the current plan, Daisuke?
   
  
  It's useful if there is. I don't plan now. For now, the idea of this
  patch is enough for me.
  
  BTW, for the above idea, I suspect that if the location in the
  physical memory is unique, it cannot deal with the kdump 2nd kernel
  case.
  
  No, not at all. The low 640K are copied away to a pre-allocated area by
  kexec purgatory code on x86_64, so it's safe to overwrite any location
  in there. The copy is needed, because BIOS already uses some hardcoded
  addresses in that range. I think the Linux kernel may safely use part of
  PFN 0 starting at physical address 0x0500. This area was originally
  used by MS-DOS, so chances are high that no broken BIOS out there
  corrupts this part of RAM...
  
 
 In fact, I didn't consider in such deep way... I had forgot back up
 region at all. But it's hard to use the low 640K area. Then, it's hard
 to get phys_base of the kdump 1st kernel that is assumed to be saved
 in thw low 640K now. Because externally running mechanism can run
 after kdump 2nd kernel has booted up, crash utility needs to convert a
 read request to the low 640K area into the corresponding part of the
 pre-allocated area. See kdump_backup_region_init() in crash utility,
 which tries to find the pre-allocated area via ELF header, where
 symbol kexec_crash_image is read to find ELF header. This means we
 need phys_base to find the pre-allocated area.
 
 Wrong again, I'm afraid.
 
 So, first of all, an admin should make up your mind if you want to use
 kexec-based dumping, or stand-alone dumping. OK, you seem to address
 a corner case when s/he configures both. But in that case, the

It's a never corner case. We usually use both. There's difference in
data reliability between kdump and others in that kdump can do cleanup
in kernel logic level at the end of the kdump 1st kernel prior to
kdump 2nd kernel, and difference in dumping feature that there's
makedumpfile that can filter memory

Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-14 Thread HATAYAMA Daisuke
From: Petr Tesarik 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Fri, 14 Nov 2014 09:31:45 +0100

> On Fri, 14 Nov 2014 10:42:35 +0900 (JST)
> HATAYAMA Daisuke  wrote:
> 
>> From: Petr Tesarik 
>> Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> VMCOREINFO
>> Date: Thu, 13 Nov 2014 15:48:10 +0100
>> 
>> > On Thu, 13 Nov 2014 09:25:48 -0500
>> > Vivek Goyal  wrote:
>> > 
>> >> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
>> >> > 
>> >> > (2014/11/13 17:06), Petr Tesarik wrote:
>> >> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
>> >> > >HATAYAMA Daisuke  wrote:
>> >> > >
>> >> > >>From: Vivek Goyal 
>> >> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> >> > >>VMCOREINFO
>> >> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500
>> >> > >>
>> >> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> >> > >>>>Currently, VMCOREINFO note information reports the virtual address 
>> >> > >>>>of
>> >> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't 
>> >> > >>>>make
>> >> > >>>>sense because to refer to value of the phys_base, it's necessary to
>> >> > >>>>get the value of phys_base itself we are now about to refer to.
>> >> > >>>>
>> >> > >>>
>> >> > >>>Hi Hatayama,
>> >> > >>>
>> >> > >>>/proc/vmcore ELF headers have virtual address information and using
>> >> > >>>that you should be able to read actual value of phys_base. gdb deals
>> >> > >>>with virtual addresses all the time and can read value of any symbol
>> >> > >>>using those headers.
>> >> > >>>
>> >> > >>>So I am not sure what's the need for exporting actual value of
>> >> > >>>phys_base.
>> >> > >>>
>> >> > >>
>> >> > >>Sorry, my logic in the patch description was wrong. For /proc/vmcore,
>> >> > >>there's enough information for makedumpdile to get phys_base. It's
>> >> > >>correct. The problem here is that other crash dump mechanisms that run
>> >> > >>outside Linux kernel independently don't have information to get
>> >> > >>phys_base.
>> >> > >
>> >> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, will
>> >> > >they?
>> >> > >
>> >> > 
>> >> > I don't intend such sophisticated function only by VMCOREINFO.
>> >> > Search vmcore for VMCOREINFO using strings + grep before opening it by 
>> >> > crash.
>> >> > I intend that only here.
>> >> 
>> >> I think this is very crude and not proper way to get to vmcoreinfo.
>> > 
>> > Same here. If VMCOREINFO must be locatable without communicating any
>> > information to the hypervisor, then I would rather go for something
>> > similar to what s390(x) folks do - a well-known location in physical
>> > memory that contains a pointer to a checksummed OS info structure,
>> > which in turn contains the VMCOREINFO pointers.
>> > 
>> > I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
>> > Or is that part of the current plan, Daisuke?
>> > 
>> 
>> It's useful if there is. I don't plan now. For now, the idea of this
>> patch is enough for me.
>> 
>> BTW, for the above idea, I suspect that if the location in the
>> physical memory is unique, it cannot deal with the kdump 2nd kernel
>> case.
> 
> No, not at all. The low 640K are copied away to a pre-allocated area by
> kexec purgatory code on x86_64, so it's safe to overwrite any location
> in there. The copy is needed, because BIOS already uses some hardcoded
> addresses in that range. I think the Linux kernel may safely use part of
> PFN 0 starting at physical address 0x0500. This area was originally
> used by MS-DOS, so chances are high that no broken BIOS out there
> corrupts this part of RAM...
> 

In fact, I didn't consider in such deep way... I had forgot back up
region at all. But it's hard to use the low 640K area. Then, it's hard
to get phys_base of the kdump 1st kernel that is assumed to be saved
in thw low 640K now. Because externally running mechanism can run
after kdump 2nd kernel has booted up, crash utility needs to convert a
read request to the low 640K area into the corresponding part of the
pre-allocated area. See kdump_backup_region_init() in crash utility,
which tries to find the pre-allocated area via ELF header, where
symbol kexec_crash_image is read to find ELF header. This means we
need phys_base to find the pre-allocated area.

> Anyway, I'm not going to implement it right now for lack of time. I'm
> adding it to my TODO list, but if anybody wants to post a patch, I
> won't be offended.
> 
> Petr T

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-14 Thread HATAYAMA Daisuke
From: Petr Tesarik ptesa...@suse.cz
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Fri, 14 Nov 2014 09:31:45 +0100

 On Fri, 14 Nov 2014 10:42:35 +0900 (JST)
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
 From: Petr Tesarik ptesa...@suse.cz
 Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
 VMCOREINFO
 Date: Thu, 13 Nov 2014 15:48:10 +0100
 
  On Thu, 13 Nov 2014 09:25:48 -0500
  Vivek Goyal vgo...@redhat.com wrote:
  
  On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
   
   (2014/11/13 17:06), Petr Tesarik wrote:
   On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
   HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
   
   From: Vivek Goyal vgo...@redhat.com
   Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
   VMCOREINFO
   Date: Wed, 12 Nov 2014 17:12:05 -0500
   
   On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
   Currently, VMCOREINFO note information reports the virtual address 
   of
   phys_base that is assigned to symbol phys_base. But this doesn't 
   make
   sense because to refer to value of the phys_base, it's necessary to
   get the value of phys_base itself we are now about to refer to.
   
   
   Hi Hatayama,
   
   /proc/vmcore ELF headers have virtual address information and using
   that you should be able to read actual value of phys_base. gdb deals
   with virtual addresses all the time and can read value of any symbol
   using those headers.
   
   So I am not sure what's the need for exporting actual value of
   phys_base.
   
   
   Sorry, my logic in the patch description was wrong. For /proc/vmcore,
   there's enough information for makedumpdile to get phys_base. It's
   correct. The problem here is that other crash dump mechanisms that run
   outside Linux kernel independently don't have information to get
   phys_base.
   
   Yes, but these mechanisms won't be able to read VMCOREINFO either, will
   they?
   
   
   I don't intend such sophisticated function only by VMCOREINFO.
   Search vmcore for VMCOREINFO using strings + grep before opening it by 
   crash.
   I intend that only here.
  
  I think this is very crude and not proper way to get to vmcoreinfo.
  
  Same here. If VMCOREINFO must be locatable without communicating any
  information to the hypervisor, then I would rather go for something
  similar to what s390(x) folks do - a well-known location in physical
  memory that contains a pointer to a checksummed OS info structure,
  which in turn contains the VMCOREINFO pointers.
  
  I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
  Or is that part of the current plan, Daisuke?
  
 
 It's useful if there is. I don't plan now. For now, the idea of this
 patch is enough for me.
 
 BTW, for the above idea, I suspect that if the location in the
 physical memory is unique, it cannot deal with the kdump 2nd kernel
 case.
 
 No, not at all. The low 640K are copied away to a pre-allocated area by
 kexec purgatory code on x86_64, so it's safe to overwrite any location
 in there. The copy is needed, because BIOS already uses some hardcoded
 addresses in that range. I think the Linux kernel may safely use part of
 PFN 0 starting at physical address 0x0500. This area was originally
 used by MS-DOS, so chances are high that no broken BIOS out there
 corrupts this part of RAM...
 

In fact, I didn't consider in such deep way... I had forgot back up
region at all. But it's hard to use the low 640K area. Then, it's hard
to get phys_base of the kdump 1st kernel that is assumed to be saved
in thw low 640K now. Because externally running mechanism can run
after kdump 2nd kernel has booted up, crash utility needs to convert a
read request to the low 640K area into the corresponding part of the
pre-allocated area. See kdump_backup_region_init() in crash utility,
which tries to find the pre-allocated area via ELF header, where
symbol kexec_crash_image is read to find ELF header. This means we
need phys_base to find the pre-allocated area.

 Anyway, I'm not going to implement it right now for lack of time. I'm
 adding it to my TODO list, but if anybody wants to post a patch, I
 won't be offended.
 
 Petr T

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA Daisuke
From: Petr Tesarik 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Thu, 13 Nov 2014 15:48:10 +0100

> On Thu, 13 Nov 2014 09:25:48 -0500
> Vivek Goyal  wrote:
> 
>> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
>> > 
>> > (2014/11/13 17:06), Petr Tesarik wrote:
>> > >On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
>> > >HATAYAMA Daisuke  wrote:
>> > >
>> > >>From: Vivek Goyal 
>> > >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> > >>VMCOREINFO
>> > >>Date: Wed, 12 Nov 2014 17:12:05 -0500
>> > >>
>> > >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> > >>>>Currently, VMCOREINFO note information reports the virtual address of
>> > >>>>phys_base that is assigned to symbol phys_base. But this doesn't make
>> > >>>>sense because to refer to value of the phys_base, it's necessary to
>> > >>>>get the value of phys_base itself we are now about to refer to.
>> > >>>>
>> > >>>
>> > >>>Hi Hatayama,
>> > >>>
>> > >>>/proc/vmcore ELF headers have virtual address information and using
>> > >>>that you should be able to read actual value of phys_base. gdb deals
>> > >>>with virtual addresses all the time and can read value of any symbol
>> > >>>using those headers.
>> > >>>
>> > >>>So I am not sure what's the need for exporting actual value of
>> > >>>phys_base.
>> > >>>
>> > >>
>> > >>Sorry, my logic in the patch description was wrong. For /proc/vmcore,
>> > >>there's enough information for makedumpdile to get phys_base. It's
>> > >>correct. The problem here is that other crash dump mechanisms that run
>> > >>outside Linux kernel independently don't have information to get
>> > >>phys_base.
>> > >
>> > >Yes, but these mechanisms won't be able to read VMCOREINFO either, will
>> > >they?
>> > >
>> > 
>> > I don't intend such sophisticated function only by VMCOREINFO.
>> > Search vmcore for VMCOREINFO using strings + grep before opening it by 
>> > crash.
>> > I intend that only here.
>> 
>> I think this is very crude and not proper way to get to vmcoreinfo.
> 
> Same here. If VMCOREINFO must be locatable without communicating any
> information to the hypervisor, then I would rather go for something
> similar to what s390(x) folks do - a well-known location in physical
> memory that contains a pointer to a checksummed OS info structure,
> which in turn contains the VMCOREINFO pointers.
> 
> I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
> Or is that part of the current plan, Daisuke?
> 

It's useful if there is. I don't plan now. For now, the idea of this
patch is enough for me.

BTW, for the above idea, I suspect that if the location in the
physical memory is unique, it cannot deal with the kdump 2nd kernel
case. I think it better for the idea to be able to represent multiple
kernel information.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Thu, 13 Nov 2014 09:25:48 -0500

> On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
>> 
>> 
>> (2014/11/13 17:06), Petr Tesarik wrote:
>> >On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
>> >HATAYAMA Daisuke  wrote:
>> >
>> >>From: Vivek Goyal 
>> >>Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
>> >>VMCOREINFO
>> >>Date: Wed, 12 Nov 2014 17:12:05 -0500
>> >>
>> >>>On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> >>>>Currently, VMCOREINFO note information reports the virtual address of
>> >>>>phys_base that is assigned to symbol phys_base. But this doesn't make
>> >>>>sense because to refer to value of the phys_base, it's necessary to
>> >>>>get the value of phys_base itself we are now about to refer to.
>> >>>>
>> >>>
>> >>>Hi Hatayama,
>> >>>
>> >>>/proc/vmcore ELF headers have virtual address information and using
>> >>>that you should be able to read actual value of phys_base. gdb deals
>> >>>with virtual addresses all the time and can read value of any symbol
>> >>>using those headers.
>> >>>
>> >>>So I am not sure what's the need for exporting actual value of
>> >>>phys_base.
>> >>>
>> >>
>> >>Sorry, my logic in the patch description was wrong. For /proc/vmcore,
>> >>there's enough information for makedumpdile to get phys_base. It's
>> >>correct. The problem here is that other crash dump mechanisms that run
>> >>outside Linux kernel independently don't have information to get
>> >>phys_base.
>> >
>> >Yes, but these mechanisms won't be able to read VMCOREINFO either, will
>> >they?
>> >
>> 
>> I don't intend such sophisticated function only by VMCOREINFO.
>> Search vmcore for VMCOREINFO using strings + grep before opening it by crash.
>> I intend that only here.
> 
> I think this is very crude and not proper way to get to vmcoreinfo. Can

I agree it's crude, but it's useful enough for my usecase.

> you give more context. What are those mechanisms and what are you trying
> to do.
> 

I after all write the same thing in the patch description... I mean
qemu dump, xendump (and other hypervisor dumps), firmware dumps
implemented on each vendor system for the crash dump mechanism.

But there's no prepared information now. So, crash utility has made
effort to calculate phys_base from the existing information. One
example is to check if how far linux_banner string is apart from where
it actually is in vmcore, and another is to walk page tables via CR4
register if it is included in note information of the crash dump
mechanism's environment.

There's no gurantee that these always work well. For example, because
these crash dump mechanisms run independently of Linux kernel, it
could be when CPU moves to some firmware code. Then, CR4 could point
at differnet page table.

Looking at VMCOREINFO, there's already phys_base information but it's
symbol value only and it doesn't make sense. So it should be
corrected. I think this is natural.

Also, I think it important to keep externally running crash dump
mechansms as independent as possible. If they depend on linux kernel,
it could reduce robustoness of their functionality. In case of
phys_base, if we added a mechanism to get phys_base, it would mean
that the crash dump mechisms doesn't work well until they got
phys_base. For robustness, I think it best to make Linux kernel put
necessary information anywhere and to make users to use them
independently.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA, Daisuke



(2014/11/13 17:06), Petr Tesarik wrote:

On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
HATAYAMA Daisuke  wrote:


From: Vivek Goyal 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 17:12:05 -0500


On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:

Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to value of the phys_base, it's necessary to
get the value of phys_base itself we are now about to refer to.



Hi Hatayama,

/proc/vmcore ELF headers have virtual address information and using
that you should be able to read actual value of phys_base. gdb deals
with virtual addresses all the time and can read value of any symbol
using those headers.

So I am not sure what's the need for exporting actual value of
phys_base.



Sorry, my logic in the patch description was wrong. For /proc/vmcore,
there's enough information for makedumpdile to get phys_base. It's
correct. The problem here is that other crash dump mechanisms that run
outside Linux kernel independently don't have information to get
phys_base.


Yes, but these mechanisms won't be able to read VMCOREINFO either, will
they?



I don't intend such sophisticated function only by VMCOREINFO.
Search vmcore for VMCOREINFO using strings + grep before opening it by crash.
I intend that only here.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA, Daisuke



(2014/11/13 17:06), Petr Tesarik wrote:

On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 17:12:05 -0500


On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:

Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to value of the phys_base, it's necessary to
get the value of phys_base itself we are now about to refer to.



Hi Hatayama,

/proc/vmcore ELF headers have virtual address information and using
that you should be able to read actual value of phys_base. gdb deals
with virtual addresses all the time and can read value of any symbol
using those headers.

So I am not sure what's the need for exporting actual value of
phys_base.



Sorry, my logic in the patch description was wrong. For /proc/vmcore,
there's enough information for makedumpdile to get phys_base. It's
correct. The problem here is that other crash dump mechanisms that run
outside Linux kernel independently don't have information to get
phys_base.


Yes, but these mechanisms won't be able to read VMCOREINFO either, will
they?



I don't intend such sophisticated function only by VMCOREINFO.
Search vmcore for VMCOREINFO using strings + grep before opening it by crash.
I intend that only here.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA Daisuke
From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Thu, 13 Nov 2014 09:25:48 -0500

 On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
 
 
 (2014/11/13 17:06), Petr Tesarik wrote:
 On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
 From: Vivek Goyal vgo...@redhat.com
 Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
 VMCOREINFO
 Date: Wed, 12 Nov 2014 17:12:05 -0500
 
 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
 Currently, VMCOREINFO note information reports the virtual address of
 phys_base that is assigned to symbol phys_base. But this doesn't make
 sense because to refer to value of the phys_base, it's necessary to
 get the value of phys_base itself we are now about to refer to.
 
 
 Hi Hatayama,
 
 /proc/vmcore ELF headers have virtual address information and using
 that you should be able to read actual value of phys_base. gdb deals
 with virtual addresses all the time and can read value of any symbol
 using those headers.
 
 So I am not sure what's the need for exporting actual value of
 phys_base.
 
 
 Sorry, my logic in the patch description was wrong. For /proc/vmcore,
 there's enough information for makedumpdile to get phys_base. It's
 correct. The problem here is that other crash dump mechanisms that run
 outside Linux kernel independently don't have information to get
 phys_base.
 
 Yes, but these mechanisms won't be able to read VMCOREINFO either, will
 they?
 
 
 I don't intend such sophisticated function only by VMCOREINFO.
 Search vmcore for VMCOREINFO using strings + grep before opening it by crash.
 I intend that only here.
 
 I think this is very crude and not proper way to get to vmcoreinfo. Can

I agree it's crude, but it's useful enough for my usecase.

 you give more context. What are those mechanisms and what are you trying
 to do.
 

I after all write the same thing in the patch description... I mean
qemu dump, xendump (and other hypervisor dumps), firmware dumps
implemented on each vendor system for the crash dump mechanism.

But there's no prepared information now. So, crash utility has made
effort to calculate phys_base from the existing information. One
example is to check if how far linux_banner string is apart from where
it actually is in vmcore, and another is to walk page tables via CR4
register if it is included in note information of the crash dump
mechanism's environment.

There's no gurantee that these always work well. For example, because
these crash dump mechanisms run independently of Linux kernel, it
could be when CPU moves to some firmware code. Then, CR4 could point
at differnet page table.

Looking at VMCOREINFO, there's already phys_base information but it's
symbol value only and it doesn't make sense. So it should be
corrected. I think this is natural.

Also, I think it important to keep externally running crash dump
mechansms as independent as possible. If they depend on linux kernel,
it could reduce robustoness of their functionality. In case of
phys_base, if we added a mechanism to get phys_base, it would mean
that the crash dump mechisms doesn't work well until they got
phys_base. For robustness, I think it best to make Linux kernel put
necessary information anywhere and to make users to use them
independently.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-13 Thread HATAYAMA Daisuke
From: Petr Tesarik ptesa...@suse.cz
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Thu, 13 Nov 2014 15:48:10 +0100

 On Thu, 13 Nov 2014 09:25:48 -0500
 Vivek Goyal vgo...@redhat.com wrote:
 
 On Thu, Nov 13, 2014 at 05:30:21PM +0900, HATAYAMA, Daisuke wrote:
  
  (2014/11/13 17:06), Petr Tesarik wrote:
  On Thu, 13 Nov 2014 09:17:09 +0900 (JST)
  HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
  
  From: Vivek Goyal vgo...@redhat.com
  Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in 
  VMCOREINFO
  Date: Wed, 12 Nov 2014 17:12:05 -0500
  
  On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
  Currently, VMCOREINFO note information reports the virtual address of
  phys_base that is assigned to symbol phys_base. But this doesn't make
  sense because to refer to value of the phys_base, it's necessary to
  get the value of phys_base itself we are now about to refer to.
  
  
  Hi Hatayama,
  
  /proc/vmcore ELF headers have virtual address information and using
  that you should be able to read actual value of phys_base. gdb deals
  with virtual addresses all the time and can read value of any symbol
  using those headers.
  
  So I am not sure what's the need for exporting actual value of
  phys_base.
  
  
  Sorry, my logic in the patch description was wrong. For /proc/vmcore,
  there's enough information for makedumpdile to get phys_base. It's
  correct. The problem here is that other crash dump mechanisms that run
  outside Linux kernel independently don't have information to get
  phys_base.
  
  Yes, but these mechanisms won't be able to read VMCOREINFO either, will
  they?
  
  
  I don't intend such sophisticated function only by VMCOREINFO.
  Search vmcore for VMCOREINFO using strings + grep before opening it by 
  crash.
  I intend that only here.
 
 I think this is very crude and not proper way to get to vmcoreinfo.
 
 Same here. If VMCOREINFO must be locatable without communicating any
 information to the hypervisor, then I would rather go for something
 similar to what s390(x) folks do - a well-known location in physical
 memory that contains a pointer to a checksummed OS info structure,
 which in turn contains the VMCOREINFO pointers.
 
 I'm a bit surprised such mechanism is not needed by Fujitsu SADUMP.
 Or is that part of the current plan, Daisuke?
 

It's useful if there is. I don't plan now. For now, the idea of this
patch is enough for me.

BTW, for the above idea, I suspect that if the location in the
physical memory is unique, it cannot deal with the kdump 2nd kernel
case. I think it better for the idea to be able to represent multiple
kernel information.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] kdump, vmcoreinfo: report actual value of phys_base

2014-11-12 Thread HATAYAMA Daisuke
Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to phys_base, it's necessary to get the value
of phys_base itself we are now about to refer to.

Userland tools related to kdump such as makedumpfile and crash utility
so far have made some efforts to calculate phys_base on crash dump
formats generated by mechanisms running outside Linux kernel, such as
virtual machine hypervisor such as qemu dump, which ordinary users use
via virsh dump, or ones implemented on vendor specific firmware.

That is, find a kernel data whose virtual and physical addresses are
available via its note information and calculate phys_base from
it. However, such data structure is not the one prepared for phys_base
purpose. There's no guarantee that other crash dump mechanisms include
such information that can be used to calculate phys_base similarly.

To get VMCOREINFO in vmcore, it's easy to use strings and grep
commands like this; VMCOREINFO consists of simple string:

$ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100
VMCOREINFO
OSRELEASE=3.10.0-121.el7.x86_64
PAGESIZE=4096
...

This is also useful to get value of phys_base in kdump 2nd kernel
contained in vmcore using the above-mentioned external crash dump
mechanism; kdump 2nd kernel is an inherently relocated kernel.

This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
makedumpfile refers to it and if removing it, old versions
makedumpfile doesn't work well.

ChangeLog:
v2:
- Introduce VMCOREINFO_PHYS_BASE().
- Correct patch description: this work is primarily for mechanisms
  running outside (kdump 1st) Linux kernel.

Signed-off-by: HATAYAMA Daisuke 
---
 arch/x86/kernel/machine_kexec_64.c | 1 +
 include/linux/kexec.h  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 4859810..47cc835 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
  (unsigned long)&_text - __START_KERNEL);
+   VMCOREINFO_PHYS_BASE(phys_base);
 }
 
 /* arch-dependent functionality related to kexec file-based syscall */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9d957b7..bee3c5b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -258,6 +258,8 @@ unsigned long paddr_vmcoreinfo_note(void);
vmcoreinfo_append_str("NUMBER(%s)=%ld\n", #name, (long)name)
 #define VMCOREINFO_CONFIG(name) \
vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
+#define VMCOREINFO_PHYS_BASE(value) \
+   vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-12 Thread HATAYAMA Daisuke
From: Petr Tesarik 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 09:14:34 +0100

> On Wed, 12 Nov 2014 15:40:42 +0900 (JST)
> HATAYAMA Daisuke  wrote:
> 
>> Currently, VMCOREINFO note information reports the virtual address of
>> phys_base that is assigned to symbol phys_base. But this doesn't make
>> sense because to refer to value of the phys_base, it's necessary to
>> get the value of phys_base itself we are now about to refer to.
>> 
>> Userland tools related to kdump such as makedumpfile and crash utility
>> so far have made some efforts to calculate phys_base from memory
>> mapping information on a variety of crash dump formats. But there's no
>> guarantee to keep maintaining it in the future.
>> 
>> This is also useful for crash dump mechanism running outside Linux
>> kernel such as virtual machine hypervisor such as qemu dump, which
>> ordinary users use via virsh dump, or ones implemented on vendor
>> specific firmware. They cannot get phys_base without special mechanism
>> because phys_base is kernel information.
>> 
>> To get VMCOREINFO in vmcore, it's easy to use strings and grep
>> commands like this; VMCOREINFO consists of simple string:
>> 
>> $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100
>> VMCOREINFO
>> OSRELEASE=3.10.0-121.el7.x86_64
>> PAGESIZE=4096
>> ...
>> 
>> Similarly, this is also useful to get value of phys_base in kdump 2nd
>> kernel contained in vmcore using the above-mentioned external crash
>> dump mechanism; kdump 2nd kernel is an inherently relocated kernel.
>> 
>> This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
>> makedumpfile refers to it and if removing it, old versions
>> makedumpfile doesn't work well.
>> 
>> Signed-off-by: HATAYAMA Daisuke 
>> ---
>>  arch/x86/kernel/machine_kexec_64.c | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index 4859810..e6d00a4 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
>>  #endif
>>  vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
>>(unsigned long)&_text - __START_KERNEL);
>> +VMCOREINFO_LENGTH(phys_base, phys_base);
> 
> While I fully agree with the concept, I don't like the use of
> VMCOREINFO_LENGTH. LENGTH(symbol) has been used to store array length
> in VMCOREINFO.
> 
> OTOH there is currently no good syntax for storing a value (short of
> VMCOREINFO_NUMBER, but that one is signed). I think it would be best to
> extend the VMCOREINFO syntax for storing variables, preferably in
> hexadecimal, e.g.:
> 
> #define VMCOREINFO_VALUE(name, value) \
>   vmcoreinfo_append_str("VALUE(%s)=%lx\n", #name, (unsigned long) value)
> 
> This interface is somewhat suboptimal, because it can only store a
> single long value. So, maybe we should dump the complete variable in
> hex, or something similar...
> 
> Anyone has a better idea?
> 
> Petr T

Just as you say, it's natural to write value of phys_base in
hexadecimal format.

For VMCOREINFO_VALUE, looking at the current helper macros, it seems
to me more natural to make it have more specific name:

#define VMCOREINFO_PHYS_BASE(value) \
vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long) value)

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-12 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 17:12:05 -0500

> On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
>> Currently, VMCOREINFO note information reports the virtual address of
>> phys_base that is assigned to symbol phys_base. But this doesn't make
>> sense because to refer to value of the phys_base, it's necessary to
>> get the value of phys_base itself we are now about to refer to.
>> 
> 
> Hi Hatayama,
> 
> /proc/vmcore ELF headers have virtual address information and using
> that you should be able to read actual value of phys_base. gdb deals
> with virtual addresses all the time and can read value of any symbol
> using those headers.
> 
> So I am not sure what's the need for exporting actual value of
> phys_base.
> 

Sorry, my logic in the patch description was wrong. For /proc/vmcore,
there's enough information for makedumpdile to get phys_base. It's
correct. The problem here is that other crash dump mechanisms that run
outside Linux kernel independently don't have information to get
phys_base.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-12 Thread HATAYAMA Daisuke
From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 17:12:05 -0500

 On Wed, Nov 12, 2014 at 03:40:42PM +0900, HATAYAMA Daisuke wrote:
 Currently, VMCOREINFO note information reports the virtual address of
 phys_base that is assigned to symbol phys_base. But this doesn't make
 sense because to refer to value of the phys_base, it's necessary to
 get the value of phys_base itself we are now about to refer to.
 
 
 Hi Hatayama,
 
 /proc/vmcore ELF headers have virtual address information and using
 that you should be able to read actual value of phys_base. gdb deals
 with virtual addresses all the time and can read value of any symbol
 using those headers.
 
 So I am not sure what's the need for exporting actual value of
 phys_base.
 

Sorry, my logic in the patch description was wrong. For /proc/vmcore,
there's enough information for makedumpdile to get phys_base. It's
correct. The problem here is that other crash dump mechanisms that run
outside Linux kernel independently don't have information to get
phys_base.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-12 Thread HATAYAMA Daisuke
From: Petr Tesarik ptesa...@suse.cz
Subject: Re: [PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO
Date: Wed, 12 Nov 2014 09:14:34 +0100

 On Wed, 12 Nov 2014 15:40:42 +0900 (JST)
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
 Currently, VMCOREINFO note information reports the virtual address of
 phys_base that is assigned to symbol phys_base. But this doesn't make
 sense because to refer to value of the phys_base, it's necessary to
 get the value of phys_base itself we are now about to refer to.
 
 Userland tools related to kdump such as makedumpfile and crash utility
 so far have made some efforts to calculate phys_base from memory
 mapping information on a variety of crash dump formats. But there's no
 guarantee to keep maintaining it in the future.
 
 This is also useful for crash dump mechanism running outside Linux
 kernel such as virtual machine hypervisor such as qemu dump, which
 ordinary users use via virsh dump, or ones implemented on vendor
 specific firmware. They cannot get phys_base without special mechanism
 because phys_base is kernel information.
 
 To get VMCOREINFO in vmcore, it's easy to use strings and grep
 commands like this; VMCOREINFO consists of simple string:
 
 $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100
 VMCOREINFO
 OSRELEASE=3.10.0-121.el7.x86_64
 PAGESIZE=4096
 ...
 
 Similarly, this is also useful to get value of phys_base in kdump 2nd
 kernel contained in vmcore using the above-mentioned external crash
 dump mechanism; kdump 2nd kernel is an inherently relocated kernel.
 
 This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
 makedumpfile refers to it and if removing it, old versions
 makedumpfile doesn't work well.
 
 Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
 ---
  arch/x86/kernel/machine_kexec_64.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/arch/x86/kernel/machine_kexec_64.c 
 b/arch/x86/kernel/machine_kexec_64.c
 index 4859810..e6d00a4 100644
 --- a/arch/x86/kernel/machine_kexec_64.c
 +++ b/arch/x86/kernel/machine_kexec_64.c
 @@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
  #endif
  vmcoreinfo_append_str(KERNELOFFSET=%lx\n,
(unsigned long)_text - __START_KERNEL);
 +VMCOREINFO_LENGTH(phys_base, phys_base);
 
 While I fully agree with the concept, I don't like the use of
 VMCOREINFO_LENGTH. LENGTH(symbol) has been used to store array length
 in VMCOREINFO.
 
 OTOH there is currently no good syntax for storing a value (short of
 VMCOREINFO_NUMBER, but that one is signed). I think it would be best to
 extend the VMCOREINFO syntax for storing variables, preferably in
 hexadecimal, e.g.:
 
 #define VMCOREINFO_VALUE(name, value) \
   vmcoreinfo_append_str(VALUE(%s)=%lx\n, #name, (unsigned long) value)
 
 This interface is somewhat suboptimal, because it can only store a
 single long value. So, maybe we should dump the complete variable in
 hex, or something similar...
 
 Anyone has a better idea?
 
 Petr T

Just as you say, it's natural to write value of phys_base in
hexadecimal format.

For VMCOREINFO_VALUE, looking at the current helper macros, it seems
to me more natural to make it have more specific name:

#define VMCOREINFO_PHYS_BASE(value) \
vmcoreinfo_append_str(PHYS_BASE=%lx\n, (unsigned long) value)

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] kdump, vmcoreinfo: report actual value of phys_base

2014-11-12 Thread HATAYAMA Daisuke
Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to phys_base, it's necessary to get the value
of phys_base itself we are now about to refer to.

Userland tools related to kdump such as makedumpfile and crash utility
so far have made some efforts to calculate phys_base on crash dump
formats generated by mechanisms running outside Linux kernel, such as
virtual machine hypervisor such as qemu dump, which ordinary users use
via virsh dump, or ones implemented on vendor specific firmware.

That is, find a kernel data whose virtual and physical addresses are
available via its note information and calculate phys_base from
it. However, such data structure is not the one prepared for phys_base
purpose. There's no guarantee that other crash dump mechanisms include
such information that can be used to calculate phys_base similarly.

To get VMCOREINFO in vmcore, it's easy to use strings and grep
commands like this; VMCOREINFO consists of simple string:

$ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100
VMCOREINFO
OSRELEASE=3.10.0-121.el7.x86_64
PAGESIZE=4096
...

This is also useful to get value of phys_base in kdump 2nd kernel
contained in vmcore using the above-mentioned external crash dump
mechanism; kdump 2nd kernel is an inherently relocated kernel.

This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
makedumpfile refers to it and if removing it, old versions
makedumpfile doesn't work well.

ChangeLog:
v2:
- Introduce VMCOREINFO_PHYS_BASE().
- Correct patch description: this work is primarily for mechanisms
  running outside (kdump 1st) Linux kernel.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 arch/x86/kernel/machine_kexec_64.c | 1 +
 include/linux/kexec.h  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 4859810..47cc835 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
vmcoreinfo_append_str(KERNELOFFSET=%lx\n,
  (unsigned long)_text - __START_KERNEL);
+   VMCOREINFO_PHYS_BASE(phys_base);
 }
 
 /* arch-dependent functionality related to kexec file-based syscall */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9d957b7..bee3c5b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -258,6 +258,8 @@ unsigned long paddr_vmcoreinfo_note(void);
vmcoreinfo_append_str(NUMBER(%s)=%ld\n, #name, (long)name)
 #define VMCOREINFO_CONFIG(name) \
vmcoreinfo_append_str(CONFIG_%s=y\n, #name)
+#define VMCOREINFO_PHYS_BASE(value) \
+   vmcoreinfo_append_str(PHYS_BASE=%lx\n, (unsigned long)value)
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
-- 
1.9.3


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-11 Thread HATAYAMA Daisuke
Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to value of the phys_base, it's necessary to
get the value of phys_base itself we are now about to refer to.

Userland tools related to kdump such as makedumpfile and crash utility
so far have made some efforts to calculate phys_base from memory
mapping information on a variety of crash dump formats. But there's no
guarantee to keep maintaining it in the future.

This is also useful for crash dump mechanism running outside Linux
kernel such as virtual machine hypervisor such as qemu dump, which
ordinary users use via virsh dump, or ones implemented on vendor
specific firmware. They cannot get phys_base without special mechanism
because phys_base is kernel information.

To get VMCOREINFO in vmcore, it's easy to use strings and grep
commands like this; VMCOREINFO consists of simple string:

$ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100
VMCOREINFO
OSRELEASE=3.10.0-121.el7.x86_64
PAGESIZE=4096
...

Similarly, this is also useful to get value of phys_base in kdump 2nd
kernel contained in vmcore using the above-mentioned external crash
dump mechanism; kdump 2nd kernel is an inherently relocated kernel.

This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
makedumpfile refers to it and if removing it, old versions
makedumpfile doesn't work well.

Signed-off-by: HATAYAMA Daisuke 
---
 arch/x86/kernel/machine_kexec_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 4859810..e6d00a4 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
  (unsigned long)&_text - __START_KERNEL);
+   VMCOREINFO_LENGTH(phys_base, phys_base);
 }
 
 /* arch-dependent functionality related to kexec file-based syscall */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kdump, x86: report actual value of phys_base in VMCOREINFO

2014-11-11 Thread HATAYAMA Daisuke
Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base. But this doesn't make
sense because to refer to value of the phys_base, it's necessary to
get the value of phys_base itself we are now about to refer to.

Userland tools related to kdump such as makedumpfile and crash utility
so far have made some efforts to calculate phys_base from memory
mapping information on a variety of crash dump formats. But there's no
guarantee to keep maintaining it in the future.

This is also useful for crash dump mechanism running outside Linux
kernel such as virtual machine hypervisor such as qemu dump, which
ordinary users use via virsh dump, or ones implemented on vendor
specific firmware. They cannot get phys_base without special mechanism
because phys_base is kernel information.

To get VMCOREINFO in vmcore, it's easy to use strings and grep
commands like this; VMCOREINFO consists of simple string:

$ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100
VMCOREINFO
OSRELEASE=3.10.0-121.el7.x86_64
PAGESIZE=4096
...

Similarly, this is also useful to get value of phys_base in kdump 2nd
kernel contained in vmcore using the above-mentioned external crash
dump mechanism; kdump 2nd kernel is an inherently relocated kernel.

This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
makedumpfile refers to it and if removing it, old versions
makedumpfile doesn't work well.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 arch/x86/kernel/machine_kexec_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 4859810..e6d00a4 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
vmcoreinfo_append_str(KERNELOFFSET=%lx\n,
  (unsigned long)_text - __START_KERNEL);
+   VMCOREINFO_LENGTH(phys_base, phys_base);
 }
 
 /* arch-dependent functionality related to kexec file-based syscall */
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-14 Thread HATAYAMA, Daisuke
_start = pos + 1;
> + }
> + }
> + if (pos > pos_start) {
> + /* Remap the rest */
> + map_size = (pos - pos_start) << PAGE_SHIFT;
> + if (remap_oldmem_pfn_range(vma, from + len, pos_start,
> +map_size, prot))
> + goto fail;
> + len += map_size;
> + }
> + return 0;
> +fail:
> + do_munmap(vma->vm_mm, from, len);
> + return -EAGAIN;
> +}
> +
> +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
> + unsigned long from, unsigned long pfn,
> + unsigned long size, pgprot_t prot)
> +{
> + /*
> +  * Check if oldmem_pfn_is_ram was registered to avoid
> +  * looping over all pages without a reason.
> +  */
> + if (oldmem_pfn_is_ram)
> + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
> + else
> + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
> +}
> +
>   static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>   {
>   size_t size = vma->vm_end - vma->vm_start;
> @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct 
> vm_area_struct *vma)
>   
>   tsz = min_t(size_t, m->offset + m->size - start, size);
>   paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -        paddr >> PAGE_SHIFT, tsz,
> -vma->vm_page_prot))
> + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot))
>   goto fail;
>   size -= tsz;
>   start += tsz;
> 

-- 
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-14 Thread HATAYAMA, Daisuke
:
 + do_munmap(vma-vm_mm, from, len);
 + return -EAGAIN;
 +}
 +
 +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
 + unsigned long from, unsigned long pfn,
 + unsigned long size, pgprot_t prot)
 +{
 + /*
 +  * Check if oldmem_pfn_is_ram was registered to avoid
 +  * looping over all pages without a reason.
 +  */
 + if (oldmem_pfn_is_ram)
 + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
 + else
 + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
 +}
 +
   static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
   {
   size_t size = vma-vm_end - vma-vm_start;
 @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct 
 vm_area_struct *vma)
   
   tsz = min_t(size_t, m-offset + m-size - start, size);
   paddr = m-paddr + start - m-offset;
 - if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
 -paddr  PAGE_SHIFT, tsz,
 -vma-vm_page_prot))
 + if (vmcore_remap_oldmem_pfn(vma, vma-vm_start + len,
 + paddr  PAGE_SHIFT, tsz,
 + vma-vm_page_prot))
   goto fail;
   size -= tsz;
   start += tsz;
 

-- 
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-13 Thread Hatayama, Daisuke/畑山 大輔
range(vma, from + len, pos_start,
> +map_size, vma->vm_page_prot))

  Is prot correct here?

> + goto fail;
> + len += map_size;
> + }
> + return 0;
> +fail:
> + do_munmap(vma->vm_mm, from, len);
> + return -EAGAIN;
> +}
> +
> +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
> + unsigned long from, unsigned long pfn,
> + unsigned long size, pgprot_t prot)
> +{
> + /*
> +  * Check if oldmem_pfn_is_ram was registered to avoid
> +  * looping over all pages without a reason.
> +  */
> + if (oldmem_pfn_is_ram)
> + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
> + else
> + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
> +}
> +
>   static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>   {
>   size_t size = vma->vm_end - vma->vm_start;
> @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct 
> vm_area_struct *vma)
>   
>   tsz = min_t(size_t, m->offset + m->size - start, size);
>   paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -paddr >> PAGE_SHIFT, tsz,
> -        vma->vm_page_prot))
> + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot))
>   goto fail;
>   size -= tsz;
>   start += tsz;
> 

It looks to me good other than the above two comments.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-13 Thread Hatayama, Daisuke/畑山 大輔
,
 + unsigned long from, unsigned long pfn,
 + unsigned long size, pgprot_t prot)
 +{
 + /*
 +  * Check if oldmem_pfn_is_ram was registered to avoid
 +  * looping over all pages without a reason.
 +  */
 + if (oldmem_pfn_is_ram)
 + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
 + else
 + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
 +}
 +
   static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
   {
   size_t size = vma-vm_end - vma-vm_start;
 @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct 
 vm_area_struct *vma)
   
   tsz = min_t(size_t, m-offset + m-size - start, size);
   paddr = m-paddr + start - m-offset;
 - if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
 -paddr  PAGE_SHIFT, tsz,
 -vma-vm_page_prot))
 + if (vmcore_remap_oldmem_pfn(vma, vma-vm_start + len,
 + paddr  PAGE_SHIFT, tsz,
 + vma-vm_page_prot))
   goto fail;
   size -= tsz;
   start += tsz;
 

It looks to me good other than the above two comments.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/urgent] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-07-02 Thread tip-bot for HATAYAMA Daisuke
Commit-ID:  b292d7a10487aee6e74b1c18b8d95b92f40d4a4f
Gitweb: http://git.kernel.org/tip/b292d7a10487aee6e74b1c18b8d95b92f40d4a4f
Author: HATAYAMA Daisuke 
AuthorDate: Wed, 25 Jun 2014 10:09:07 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 2 Jul 2014 08:35:55 +0200

perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
   indicate changes to the state of performancmonitoring hardware"

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke 
Acked-by: Don Zickus 
Signed-off-by: Peter Zijlstra 
Cc: 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/20140625.103503.409316067.d.hatay...@jp.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *))) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *))) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/urgent] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-07-02 Thread tip-bot for HATAYAMA Daisuke
Commit-ID:  b292d7a10487aee6e74b1c18b8d95b92f40d4a4f
Gitweb: http://git.kernel.org/tip/b292d7a10487aee6e74b1c18b8d95b92f40d4a4f
Author: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
AuthorDate: Wed, 25 Jun 2014 10:09:07 +0900
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 2 Jul 2014 08:35:55 +0200

perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
   indicate changes to the state of performancmonitoring hardware

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Acked-by: Don Zickus dzic...@redhat.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Cc: sta...@vger.kernel.org
Cc: Arnaldo Carvalho de Melo a...@kernel.org
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/20140625.103503.409316067.d.hatay...@jp.fujitsu.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *)status)) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *)status)) {
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-07-01 Thread HATAYAMA Daisuke
From: Andi Kleen 
Subject: Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Mon, 30 Jun 2014 15:22:24 -0700

>> 
>> I'm also interested in the behaviour of CondChgd bit on Ivy Bridge 
>> processors.
> 
> The intended meaning of CondChgd is that a hardware debugger has taken over 
> the
> PMU. It shouldn't really be set in other circumstances.
> 

Thanks for your explanation.

The hardware debugger you mean is a kind of ITP?

> I think right now for perf it would be best to just ignore it.
> 
> In theory could stop using the PMU, but if some BIOS set it it would
> completely disable perf there. So better to just ignore it.
> 

I'll reflect this information in the patch description.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-07-01 Thread HATAYAMA Daisuke
From: Andi Kleen a...@linux.intel.com
Subject: Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Mon, 30 Jun 2014 15:22:24 -0700

 
 I'm also interested in the behaviour of CondChgd bit on Ivy Bridge 
 processors.
 
 The intended meaning of CondChgd is that a hardware debugger has taken over 
 the
 PMU. It shouldn't really be set in other circumstances.
 

Thanks for your explanation.

The hardware debugger you mean is a kind of ITP?

 I think right now for perf it would be best to just ignore it.
 
 In theory could stop using the PMU, but if some BIOS set it it would
 completely disable perf there. So better to just ignore it.
 

I'll reflect this information in the patch description.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-30 Thread HATAYAMA, Daisuke

Hello,

(2014/06/17 0:30), Don Zickus wrote:

On Fri, Jun 13, 2014 at 05:44:37PM +0900, HATAYAMA Daisuke wrote:

Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's. As a result, if CondChgd bit is set, any NMI is
falsely handled by the NMI handler of NMI watchdog. Also, if type of
the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or
NMI_IO_CHECK, the corresponding action is never performed until
CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

   In 18.9.1:

   "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
indicate changes to the state of performancmonitoring hardware"

   In Table 35-2 IA-32 Architectural MSRs

   63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.


As I said in a previous email, I ran into a similar problem and was going
to solve it by zeroing out all the registers on init (which probably would
have upset Peter :-) ).  This is a smaller solution and seems ok.  The
only downside is it is called in the nmi handler.


I am working with our customer to try and talk with Intel why this bit is
set to begin with.  Our customer says their BIOS doesn't use the PMU
during boot so it wasn't clear why this is now set on IVBs (though I don't
see them on Intel whitebox IVBs).



I'm also interested in the behaviour of CondChgd bit on Ivy Bridge processors.

Do you know something about this behaviour?

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-30 Thread HATAYAMA, Daisuke

Hello,

(2014/06/17 0:30), Don Zickus wrote:

On Fri, Jun 13, 2014 at 05:44:37PM +0900, HATAYAMA Daisuke wrote:

Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's. As a result, if CondChgd bit is set, any NMI is
falsely handled by the NMI handler of NMI watchdog. Also, if type of
the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or
NMI_IO_CHECK, the corresponding action is never performed until
CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

   In 18.9.1:

   The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
indicate changes to the state of performancmonitoring hardware

   In Table 35-2 IA-32 Architectural MSRs

   63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.


As I said in a previous email, I ran into a similar problem and was going
to solve it by zeroing out all the registers on init (which probably would
have upset Peter :-) ).  This is a smaller solution and seems ok.  The
only downside is it is called in the nmi handler.


I am working with our customer to try and talk with Intel why this bit is
set to begin with.  Our customer says their BIOS doesn't use the PMU
during boot so it wasn't clear why this is now set on IVBs (though I don't
see them on Intel whitebox IVBs).



I'm also interested in the behaviour of CondChgd bit on Ivy Bridge processors.

Do you know something about this behaviour?

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-24 Thread HATAYAMA Daisuke
This is a resend patch from the previous v2 patch since there's no
response except for Don.

ChangeLog

  v2 => v3
  - Add Acked-by by Don Zickus
  - Rebase the patch based on v3.16-rc2
  - Rewrite patch description a little

  v1 => v2)
  - Add reason and the fact I know on CondChgd in patch description

From 63d6fa2a3dfa9ff7d1710c85d8d020cdc37f3b49 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke 
Date: Wed, 25 Jun 2014 10:09:07 +0900
Subject: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI
 handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
   indicate changes to the state of performancmonitoring hardware"

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke 
Acked-by: Don Zickus 
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *))) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *))) {
-- 
1.9.3
N‹§²ζμrΈ›yϊθšΨb²X¬ΆΗ§vΨ^–)ήΊ{.nΗ+‰·₯Š{±‘κηzX§Ά›‘ά¨}©ž²Ζ 
zΪ:+v‰¨Ύ«‘κηzZ+€Κ+zf£’·hšˆ§~†­†Ϋi�ϋΰzΉ�w₯’Έ?™¨θ­Ϊ&’)ί’f”ω^jΗ«y§m…α@A«aΆΪ�
0Άμh�ε’i

[PATCH v3] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-24 Thread HATAYAMA Daisuke
This is a resend patch from the previous v2 patch since there's no
response except for Don.

ChangeLog

  v2 = v3
  - Add Acked-by by Don Zickus
  - Rebase the patch based on v3.16-rc2
  - Rewrite patch description a little

  v1 = v2)
  - Add reason and the fact I know on CondChgd in patch description

From 63d6fa2a3dfa9ff7d1710c85d8d020cdc37f3b49 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Date: Wed, 25 Jun 2014 10:09:07 +0900
Subject: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI
 handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
   indicate changes to the state of performancmonitoring hardware

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Acked-by: Don Zickus dzic...@redhat.com
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *)status)) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *)status)) {
-- 
1.9.3
N‹§²ζμrΈ›yϊθšΨb²X¬ΆΗ§vΨ^–)ήΊ{.nΗ+‰·₯Š{±‘κηzX§Ά›‘ά¨}©ž²Ζ 
zΪj:+v‰¨Ύ«‘κηzZ+€Κ+zf£’·hšˆ§~†­†Ϋi�ϋΰzΉ�w₯’Έ?™¨θ­Ϊ’)ί’f”ω^jΗ«y§m…α@A«aΆΪ�
0Άμh�ε’i

[PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-13 Thread HATAYAMA Daisuke
Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's. As a result, if CondChgd bit is set, any NMI is
falsely handled by the NMI handler of NMI watchdog. Also, if type of
the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or
NMI_IO_CHECK, the corresponding action is never performed until
CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  "The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
   indicate changes to the state of performancmonitoring hardware"

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke 
---
 arch/x86/kernel/cpu/perf_event_intel.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *))) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *))) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-13 Thread HATAYAMA Daisuke
Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's. As a result, if CondChgd bit is set, any NMI is
falsely handled by the NMI handler of NMI watchdog. Also, if type of
the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or
NMI_IO_CHECK, the corresponding action is never performed until
CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  The MSR_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
   indicate changes to the state of performancmonitoring hardware

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/perf_event_intel.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *)status)) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *)status)) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-12 Thread HATAYAMA Daisuke
From: Peter Zijlstra 
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Wed, 11 Jun 2014 13:54:13 +0200

> On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote:
>> > I'm not sure about exact behavior of CondChgd bit, in particular when
>> > this bit is set. Although I read Intel System Programmer's Manual to
>> > figure out but I have yet completed that. At least, I think ignoring
>> > CondChgd bit should be enough for NMI watchdog perspective.
>> 
>> So yes, the SDM lists the bit as existing but never once mentions it
>> outside of that, and its been doing that at least back to 2008.
>> 
>> Ooh, I found it:
>> 
>>   "The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
>> indicate changes to the state of performance monitoring hardware (see
>> Figure 18-29)."
>> 
>> Which is of course completely useless, not to mention inconsistent with
>> the later CondChgd name.
>> 
>> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
>> like having that set on boot? 
> 
> Matt found in the MSR listing for GLOBAL_STATUS:
> 
> 63 CondChg: status bits of this register has changed.  If CPUID.0AH: EAX[7:0] 
> > 0
> 
> Which brings us to a grand total of 3 different names for this bit.
> 
> If it indeed does what it says on the tin, set every time the status
> changes its like the most useless bit ever and I wonder why people
> bothered to spend silicon on it.
> 

Yes, I didn't mention in patch description this, but I reached the
same conclusion. The description confuses me because the desciption
and the behaviour of CondChg bit I see on the actual system is not
coincide.

Also, I checked cpuid on the system with Neharlem processor where I
have never seen CondChg bit is set.

[root@localhost ~]# ./cpuid -r
CPU 0:
   0x 0x00: eax=0x000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x0001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff

   0x000a 0x00: eax=0x07300403 ebx=0x0044 ecx=0x edx=0x0603
^^
So, cpuid tells that CondChg bit is supported on this processor, too.

> In any case, the proposed patch seems fine, just needs a better
> changelog.
> 

I see.

I'll write that the problem is that any NMI could be robbed by NMI
watchdog explicitly. Now only patch title says this explicitly. This
is your first comment.

About CondChgd bit, I cannot write more than I see on actual
system. If it's necessary to describe more about CondChgd bit, it
would be appreciated if someone tell me more information about it.

Thanks.
HATAYAMA, Daisuke
N‹§²ζμrΈ›yϊθšΨb²X¬ΆΗ§vΨ^–)ήΊ{.nΗ+‰·₯Š{±‘κηzX§Ά›‘ά¨}©ž²Ζ 
zΪ:+v‰¨Ύ«‘κηzZ+€Κ+zf£’·hšˆ§~†­†Ϋi�ϋΰzΉ�w₯’Έ?™¨θ­Ϊ&’)ί’f”ω^jΗ«y§m…α@A«aΆΪ�
0Άμh�ε’i

Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-12 Thread HATAYAMA Daisuke
From: Peter Zijlstra 
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Wed, 11 Jun 2014 10:54:48 +0200

> On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote:
>> Currently, a NMI handler for NMI watchdog may falsely handle any NMI
>> signaled for different purpose if CondChgd bit in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
>> 
>> This commit deals with the issue simply by ignoring CondChgd bit.
>> 
>> Here is explanation in detail.
>> 
>> On x86 NMI watchdog uses performance monitoring feature to
>> periodically signal NMI each time performance counter gets overflowed.
>> 
>> intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
>> handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
>> of a given NMI by looking at overflow status bits in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
>> handles the given NMI as its own NMI.
>> 
>> The problem is that intel_pmu_handle_irq() doesn't distinguish
>> CondChgd bit from other bits. Unlike the other status bits, CondChgd
>> bit doesn't represent overflow status for performance counters. Thus,
>> CondChgd bit cannot be thought of as a mark indicating a given NMI is
>> NMI watchdog's.
> 
> So what was the problem? It ate another NMI?
> 

Yes.

The handler for NMI watchdog is called in NMI_LOCAL. This is done in
earlier timing than NMI_UNKNOWN, and if some of the handlers in
NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are
not called. Like this:

handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);
if (handled) {
/*
 * There are cases when a NMI handler handles multiple
 * events in the current NMI.  One of these events may
 * be queued for in the next NMI.  Because the event is
 * already handled, the next NMI will result in an unknown
 * NMI.  Instead lets flag this for a potential NMI to
 * swallow.
 */
if (handled > 1)
__this_cpu_write(swallow_nmi, true);
return;
}

For example, we use unknown NMI to make system panic by enabling
kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by
NMI handler for NMI watchdog.

>> I noticed this behavior on systems with Ivy Bridge processors: Intel
>> Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
>> CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
>> in the beginning at boot. (then the CondChgd bit is cleared by next
>> wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
>> 
>> On the other hand, on older processors such as Nehalem, CondChgd bit
>> is not set in the beginning at boot.
>> 
>> I'm not sure about exact behavior of CondChgd bit, in particular when
>> this bit is set. Although I read Intel System Programmer's Manual to
>> figure out but I have yet completed that. At least, I think ignoring
>> CondChgd bit should be enough for NMI watchdog perspective.
> 
> So yes, the SDM lists the bit as existing but never once mentions it
> outside of that, and its been doing that at least back to 2008.
> 
> Ooh, I found it:
> 
>   "The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
> indicate changes to the state of performance monitoring hardware (see
> Figure 18-29)."
> 
> Which is of course completely useless, not to mention inconsistent with
> the later CondChgd name.
> 
> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
> like having that set on boot? 

--
Thanks.
HATAYAMA, Daisuke


Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-12 Thread HATAYAMA Daisuke
From: Peter Zijlstra pet...@infradead.org
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Wed, 11 Jun 2014 10:54:48 +0200

 On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote:
 Currently, a NMI handler for NMI watchdog may falsely handle any NMI
 signaled for different purpose if CondChgd bit in
 MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
 
 This commit deals with the issue simply by ignoring CondChgd bit.
 
 Here is explanation in detail.
 
 On x86 NMI watchdog uses performance monitoring feature to
 periodically signal NMI each time performance counter gets overflowed.
 
 intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
 handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
 of a given NMI by looking at overflow status bits in
 MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
 handles the given NMI as its own NMI.
 
 The problem is that intel_pmu_handle_irq() doesn't distinguish
 CondChgd bit from other bits. Unlike the other status bits, CondChgd
 bit doesn't represent overflow status for performance counters. Thus,
 CondChgd bit cannot be thought of as a mark indicating a given NMI is
 NMI watchdog's.
 
 So what was the problem? It ate another NMI?
 

Yes.

The handler for NMI watchdog is called in NMI_LOCAL. This is done in
earlier timing than NMI_UNKNOWN, and if some of the handlers in
NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are
not called. Like this:

handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);
if (handled) {
/*
 * There are cases when a NMI handler handles multiple
 * events in the current NMI.  One of these events may
 * be queued for in the next NMI.  Because the event is
 * already handled, the next NMI will result in an unknown
 * NMI.  Instead lets flag this for a potential NMI to
 * swallow.
 */
if (handled  1)
__this_cpu_write(swallow_nmi, true);
return;
}

For example, we use unknown NMI to make system panic by enabling
kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by
NMI handler for NMI watchdog.

 I noticed this behavior on systems with Ivy Bridge processors: Intel
 Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
 CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
 in the beginning at boot. (then the CondChgd bit is cleared by next
 wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
 
 On the other hand, on older processors such as Nehalem, CondChgd bit
 is not set in the beginning at boot.
 
 I'm not sure about exact behavior of CondChgd bit, in particular when
 this bit is set. Although I read Intel System Programmer's Manual to
 figure out but I have yet completed that. At least, I think ignoring
 CondChgd bit should be enough for NMI watchdog perspective.
 
 So yes, the SDM lists the bit as existing but never once mentions it
 outside of that, and its been doing that at least back to 2008.
 
 Ooh, I found it:
 
   The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
 indicate changes to the state of performance monitoring hardware (see
 Figure 18-29).
 
 Which is of course completely useless, not to mention inconsistent with
 the later CondChgd name.
 
 HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
 like having that set on boot? 

--
Thanks.
HATAYAMA, Daisuke


Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-12 Thread HATAYAMA Daisuke
From: Peter Zijlstra pet...@infradead.org
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI 
handling
Date: Wed, 11 Jun 2014 13:54:13 +0200

 On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote:
  I'm not sure about exact behavior of CondChgd bit, in particular when
  this bit is set. Although I read Intel System Programmer's Manual to
  figure out but I have yet completed that. At least, I think ignoring
  CondChgd bit should be enough for NMI watchdog perspective.
 
 So yes, the SDM lists the bit as existing but never once mentions it
 outside of that, and its been doing that at least back to 2008.
 
 Ooh, I found it:
 
   The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
 indicate changes to the state of performance monitoring hardware (see
 Figure 18-29).
 
 Which is of course completely useless, not to mention inconsistent with
 the later CondChgd name.
 
 HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
 like having that set on boot? 
 
 Matt found in the MSR listing for GLOBAL_STATUS:
 
 63 CondChg: status bits of this register has changed.  If CPUID.0AH: EAX[7:0] 
  0
 
 Which brings us to a grand total of 3 different names for this bit.
 
 If it indeed does what it says on the tin, set every time the status
 changes its like the most useless bit ever and I wonder why people
 bothered to spend silicon on it.
 

Yes, I didn't mention in patch description this, but I reached the
same conclusion. The description confuses me because the desciption
and the behaviour of CondChg bit I see on the actual system is not
coincide.

Also, I checked cpuid on the system with Neharlem processor where I
have never seen CondChg bit is set.

[root@localhost ~]# ./cpuid -r
CPU 0:
   0x 0x00: eax=0x000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x0001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff
snip
   0x000a 0x00: eax=0x07300403 ebx=0x0044 ecx=0x edx=0x0603
^^
So, cpuid tells that CondChg bit is supported on this processor, too.

 In any case, the proposed patch seems fine, just needs a better
 changelog.
 

I see.

I'll write that the problem is that any NMI could be robbed by NMI
watchdog explicitly. Now only patch title says this explicitly. This
is your first comment.

About CondChgd bit, I cannot write more than I see on actual
system. If it's necessary to describe more about CondChgd bit, it
would be appreciated if someone tell me more information about it.

Thanks.
HATAYAMA, Daisuke
N‹§²ζμrΈ›yϊθšΨb²X¬ΆΗ§vΨ^–)ήΊ{.nΗ+‰·₯Š{±‘κηzX§Ά›‘ά¨}©ž²Ζ 
zΪj:+v‰¨Ύ«‘κηzZ+€Κ+zf£’·hšˆ§~†­†Ϋi�ϋΰzΉ�w₯’Έ?™¨θ­Ϊ’)ί’f”ω^jΗ«y§m…α@A«aΆΪ�
0Άμh�ε’i

[PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-11 Thread HATAYAMA Daisuke
Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. (then the CondChgd bit is cleared by next
wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)

On the other hand, on older processors such as Nehalem, CondChgd bit
is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out but I have yet completed that. At least, I think ignoring
CondChgd bit should be enough for NMI watchdog perspective.

Signed-off-by: HATAYAMA Daisuke 
---
 arch/x86/kernel/cpu/perf_event_intel.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *))) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *))) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

2014-06-11 Thread HATAYAMA Daisuke
Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. (then the CondChgd bit is cleared by next
wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)

On the other hand, on older processors such as Nehalem, CondChgd bit
is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out but I have yet completed that. At least, I think ignoring
CondChgd bit should be enough for NMI watchdog perspective.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/perf_event_intel.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
 
/*
+* CondChgd bit 63 doesn't mean any overflow status. Ignore
+* and clear the bit.
+*/
+   if (__test_and_clear_bit(63, (unsigned long *)status)) {
+   if (!status)
+   goto done;
+   }
+
+   /*
 * PEBS overflow sets bit 62 in the global status register
 */
if (__test_and_clear_bit(62, (unsigned long *)status)) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process

2014-01-26 Thread HATAYAMA Daisuke
(2014/01/26 13:02), David Rientjes wrote:
> On Thu, 16 Jan 2014, HATAYAMA Daisuke wrote:
> 
>> Hello,
>>
>> This patch deals with the issue of handling boot_cpu_physical_apicid
>> in boot process I avoided in disable_cpu_apicid patch because I
>> cannot guess how long it needs to take for the review of this fix.
>>
>> This patch is made on top of today's x86/apic branch of tip tree.
>> Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb
>>
> 
> This breaks with SGI 320/540:
> 
> arch/x86/platform/visws/visws_quirks.c: In function ‘MP_processor_info’:
> arch/x86/platform/visws/visws_quirks.c:169:3: error: 
> ‘bios_cpu_physical_apicid’ undeclared (first use in this function)
> arch/x86/platform/visws/visws_quirks.c:169:3: note: each undeclared 
> identifier is reported only once for each function it appears in
> 

Thanks. I'll retry with CONFIG_X86_VISWS...

obj-$(CONFIG_X86_VISWS) += visws_quirks.o

> It makes it pretty apparent that you want a different name for the new
> variable, bios_bsp_physical_apicid is just too close to
> boot_cpu_physical_apicid that even the author of the patch missed a
> conversion.
> 

``bsp'' and ``boot cpu'' are different, but certainly they are close
for most of people who don't need to focus the difference.

One idea is not to add bios_bsp_physical_apicid, just removing
assignment to boot_cpu_physical_apicid from MP_processor_info().
I guess currently no one uses apicid provided by MP table
since boot_cpu_physical_apicid is finally initialized in
init_apic_mappings().

> (It's also really sad that we can't make it __initdata since it's only
> valid at boot because of the hotplug stuff.)
> 


-- 
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process

2014-01-26 Thread HATAYAMA Daisuke
(2014/01/26 13:02), David Rientjes wrote:
 On Thu, 16 Jan 2014, HATAYAMA Daisuke wrote:
 
 Hello,

 This patch deals with the issue of handling boot_cpu_physical_apicid
 in boot process I avoided in disable_cpu_apicid patch because I
 cannot guess how long it needs to take for the review of this fix.

 This patch is made on top of today's x86/apic branch of tip tree.
 Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb

 
 This breaks with SGI 320/540:
 
 arch/x86/platform/visws/visws_quirks.c: In function ‘MP_processor_info’:
 arch/x86/platform/visws/visws_quirks.c:169:3: error: 
 ‘bios_cpu_physical_apicid’ undeclared (first use in this function)
 arch/x86/platform/visws/visws_quirks.c:169:3: note: each undeclared 
 identifier is reported only once for each function it appears in
 

Thanks. I'll retry with CONFIG_X86_VISWS...

obj-$(CONFIG_X86_VISWS) += visws_quirks.o

 It makes it pretty apparent that you want a different name for the new
 variable, bios_bsp_physical_apicid is just too close to
 boot_cpu_physical_apicid that even the author of the patch missed a
 conversion.
 

``bsp'' and ``boot cpu'' are different, but certainly they are close
for most of people who don't need to focus the difference.

One idea is not to add bios_bsp_physical_apicid, just removing
assignment to boot_cpu_physical_apicid from MP_processor_info().
I guess currently no one uses apicid provided by MP table
since boot_cpu_physical_apicid is finally initialized in
init_apic_mappings().

 (It's also really sad that we can't make it __initdata since it's only
 valid at boot because of the hotplug stuff.)
 


-- 
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process

2014-01-16 Thread HATAYAMA Daisuke
Hello,

This patch deals with the issue of handling boot_cpu_physical_apicid
in boot process I avoided in disable_cpu_apicid patch because I
cannot guess how long it needs to take for the review of this fix.

This patch is made on top of today's x86/apic branch of tip tree.
Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb

I tested this patch in each combination of the following table:

  BIOS tableboot cpu
-+--
  ACPI MADT AP
  MP Table  BSP


>From 10946038252fc91f3ae2740953b2484ea0076ed4 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke 
Date: Thu, 16 Jan 2014 15:47:25 +0900
Subject: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in
 boot process

boot_cpu_physical_apicid variable is intended to have the apicid of
the processor that is doing the boot up by read_apic_id() that picks
up the value from APIC_ID register of local APIC.

boot_cpu_physical_apicid variable could be initialized with the value
from MP table if there's no acpi support on the system or acpi=off is
manually specified.

The first issue is that if kexec boots the 2nd kernel on some AP, the
processor that is doing the boot up of the 2nd kernel is no longer
BSP. This means that boot_cpu_physical_apicid is modified to
unintended BSP's apicid.

The second issue is that in the current code, once the
boot_cpu_physical_apicid variable is modified by the value from MP
table, it is again modified back to read_apic_id().

Concretely, they are processed as follows in setup_arch():

  1) boot_cpu_physical_apicid is first initialized by read_apic_id()
 in acpi_boot_init(), then
  2) re-initialized by get_smp_config() and finally
  3) modified back by read_apic_id() in init_apic_mappings().

In source code:

/*
 * Read APIC and some other early information from ACPI tables.
 */
acpi_boot_init();   <== 1)
sfi_init();
x86_dtb_init();

/*
 * get boot-time SMP configuration:
 */
if (smp_found_config)
get_smp_config();   <== 2)

prefill_possible_map();

init_cpu_to_node();

init_apic_mappings();   <== 3)

It is confusing that meaning of the variable changes in the middle of
boot up.

To fix these issues, this patch adds bios_bsp_physical_apicid variable
to store the apicid value reported by BIOS via ACPI MADT or MP table,
by which boot_cpu_physical_apicid variable is cleanly handled in boot
up.

This patch also fixes the workaround done in generic_processor_info()
that directly calls read_apic_id() to avoid the first issue.

Signed-off-by: HATAYAMA Daisuke 
---
 arch/x86/include/asm/mpspec.h  |  1 +
 arch/x86/kernel/acpi/boot.c|  8 
 arch/x86/kernel/apic/apic.c| 31 ++-
 arch/x86/kernel/mpparse.c  |  2 +-
 arch/x86/platform/visws/visws_quirks.c |  2 +-
 5 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 3142a94..724853f 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,6 +47,7 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
 extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
 
 extern unsigned int boot_cpu_physical_apicid;
+extern unsigned int bios_bsp_physical_apicid;
 extern unsigned int max_physical_apicid;
 extern int mpc_default_type;
 extern unsigned long mp_lapic_addr;
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 6c0b43b..4f460e2 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -213,6 +213,14 @@ static int acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
+   /*
+* ACPI specification describes that platform firmware should
+* list the BSP processor as the first LAPIC entry in the
+* MADT.
+*/
+   if (!num_processors && !disabled_cpus)
+   bios_bsp_physical_apicid = id;
+
return generic_processor_info(id, ver);
 }
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 7f26c9a..107614f 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -64,6 +64,15 @@ unsigned disabled_cpus;
 unsigned int boot_cpu_physical_apicid = -1U;
 EXPORT_SYMBOL_GPL(boot_cpu_physical_apicid);
 
+/* Processor with BP flag on IP32_APIC_BASE MSR to be initialized with
+ * the value reported by BIOS tables such as ACPI MADT or MP table.
+ *
+ * If kexec boots the 2nd kernel on some AP, this differs from the
+ * processor that is doing the boot up: boot_cpu_physical_apicid !=
+ * bios_bsp_physical_apicid.
+ */
+unsigned int bios_bsp_physical_apicid = BAD_APICID;
+
 /*
  * The highest APIC ID seen during enumeration.
  */
@@ -2120,28 +2129,8 @@ int generic_processor_info(int

[PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in boot process

2014-01-16 Thread HATAYAMA Daisuke
Hello,

This patch deals with the issue of handling boot_cpu_physical_apicid
in boot process I avoided in disable_cpu_apicid patch because I
cannot guess how long it needs to take for the review of this fix.

This patch is made on top of today's x86/apic branch of tip tree.
Its commit hash is 5b4d1dbc24bb6fd7179ada0f47be34e27e64decb

I tested this patch in each combination of the following table:

  BIOS tableboot cpu
-+--
  ACPI MADT AP
  MP Table  BSP


From 10946038252fc91f3ae2740953b2484ea0076ed4 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Date: Thu, 16 Jan 2014 15:47:25 +0900
Subject: [PATCH] x86, apic: clean up handling of boot_cpu_physical_apicid in
 boot process

boot_cpu_physical_apicid variable is intended to have the apicid of
the processor that is doing the boot up by read_apic_id() that picks
up the value from APIC_ID register of local APIC.

boot_cpu_physical_apicid variable could be initialized with the value
from MP table if there's no acpi support on the system or acpi=off is
manually specified.

The first issue is that if kexec boots the 2nd kernel on some AP, the
processor that is doing the boot up of the 2nd kernel is no longer
BSP. This means that boot_cpu_physical_apicid is modified to
unintended BSP's apicid.

The second issue is that in the current code, once the
boot_cpu_physical_apicid variable is modified by the value from MP
table, it is again modified back to read_apic_id().

Concretely, they are processed as follows in setup_arch():

  1) boot_cpu_physical_apicid is first initialized by read_apic_id()
 in acpi_boot_init(), then
  2) re-initialized by get_smp_config() and finally
  3) modified back by read_apic_id() in init_apic_mappings().

In source code:

/*
 * Read APIC and some other early information from ACPI tables.
 */
acpi_boot_init();   == 1)
sfi_init();
x86_dtb_init();

/*
 * get boot-time SMP configuration:
 */
if (smp_found_config)
get_smp_config();   == 2)

prefill_possible_map();

init_cpu_to_node();

init_apic_mappings();   == 3)

It is confusing that meaning of the variable changes in the middle of
boot up.

To fix these issues, this patch adds bios_bsp_physical_apicid variable
to store the apicid value reported by BIOS via ACPI MADT or MP table,
by which boot_cpu_physical_apicid variable is cleanly handled in boot
up.

This patch also fixes the workaround done in generic_processor_info()
that directly calls read_apic_id() to avoid the first issue.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 arch/x86/include/asm/mpspec.h  |  1 +
 arch/x86/kernel/acpi/boot.c|  8 
 arch/x86/kernel/apic/apic.c| 31 ++-
 arch/x86/kernel/mpparse.c  |  2 +-
 arch/x86/platform/visws/visws_quirks.c |  2 +-
 5 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 3142a94..724853f 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,6 +47,7 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
 extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
 
 extern unsigned int boot_cpu_physical_apicid;
+extern unsigned int bios_bsp_physical_apicid;
 extern unsigned int max_physical_apicid;
 extern int mpc_default_type;
 extern unsigned long mp_lapic_addr;
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 6c0b43b..4f460e2 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -213,6 +213,14 @@ static int acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
+   /*
+* ACPI specification describes that platform firmware should
+* list the BSP processor as the first LAPIC entry in the
+* MADT.
+*/
+   if (!num_processors  !disabled_cpus)
+   bios_bsp_physical_apicid = id;
+
return generic_processor_info(id, ver);
 }
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 7f26c9a..107614f 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -64,6 +64,15 @@ unsigned disabled_cpus;
 unsigned int boot_cpu_physical_apicid = -1U;
 EXPORT_SYMBOL_GPL(boot_cpu_physical_apicid);
 
+/* Processor with BP flag on IP32_APIC_BASE MSR to be initialized with
+ * the value reported by BIOS tables such as ACPI MADT or MP table.
+ *
+ * If kexec boots the 2nd kernel on some AP, this differs from the
+ * processor that is doing the boot up: boot_cpu_physical_apicid !=
+ * bios_bsp_physical_apicid.
+ */
+unsigned int bios_bsp_physical_apicid = BAD_APICID;
+
 /*
  * The highest APIC ID seen during enumeration.
  */
@@ -2120,28 +2129,8 @@ int

Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

2014-01-15 Thread HATAYAMA Daisuke

(2014/01/16 14:53), H. Peter Anvin wrote:

On 01/15/2014 08:44 PM, HATAYAMA Daisuke wrote:


This is not typo in my intention.

generic_processor_info() has two more cases where it ignores cpus.
In either cases, printed messages are tagged with "ACPI" because this
function is called when parsing ACPI MADT table in acpi_boot_init();
this function is also being used to parse other kind of tables but
the "ACPI" tag would mean that the function was first for ACPI only.



But it has nothing to do with ACPI -- it is an APIC ID from the command
line -- so that would be actively misleading.

-hpa



I never disagree to the fix itself. I wanted to explain why I wrote so.

I thought it was better to unify tags in the same function because they
should bleong to the same component, here I mean ACPI, but it's better
to avoid the confusion, which is bigger impact.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

2014-01-15 Thread HATAYAMA Daisuke

(2014/01/16 6:09), tip-bot for H. Peter Anvin wrote:

Commit-ID:  5b4d1dbc24bb6fd7179ada0f47be34e27e64decb
Gitweb: http://git.kernel.org/tip/5b4d1dbc24bb6fd7179ada0f47be34e27e64decb
Author: H. Peter Anvin 
AuthorDate: Wed, 15 Jan 2014 13:02:08 -0800
Committer:  H. Peter Anvin 
CommitDate: Wed, 15 Jan 2014 13:02:08 -0800

x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

Make disabled_cpu_apicid static and read_mostly, and fix a couple of
typos.

Reported-by: Ingo Molnar 
Link: http://lkml.kernel.org/r/20140115182511.ga22...@gmail.com
Signed-off-by: H. Peter Anvin 
Cc: HATAYAMA Daisuke 
---
  arch/x86/kernel/apic/apic.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e78ab8c..7f26c9a 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -79,7 +79,7 @@ physid_mask_t phys_cpu_present_map;
   * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to
   * avoid undefined behaviour caused by sending INIT from AP to BSP.
   */
-unsigned int disabled_cpu_apicid = BAD_APICID;
+static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;

  /*
   * Map cpu index to physical APIC ID
@@ -2124,7 +2124,7 @@ int generic_processor_info(int apicid, int version)
 * boot_cpu_physical_apicid is designed to have the apicid
 * returned by read_apic_id(), i.e, the apicid of the
 * currently booting-up processor. However, on some platforms,
-* it is temporarilly modified by the apicid reported as BSP
+* it is temporarily modified by the apicid reported as BSP
 * through MP table. Concretely:
 *
 * - arch/x86/kernel/mpparse.c: MP_processor_info()
@@ -2145,7 +2145,7 @@ int generic_processor_info(int apicid, int version)
disabled_cpu_apicid == apicid) {
int thiscpu = num_processors + disabled_cpus;

-   pr_warning("ACPI: Disabling requested cpu."
+   pr_warning("APIC: Disabling requested cpu."
   " Processor %d/0x%x ignored.\n",
   thiscpu, apicid);


This is not typo in my intention.

generic_processor_info() has two more cases where it ignores cpus.
In either cases, printed messages are tagged with "ACPI" because this
function is called when parsing ACPI MADT table in acpi_boot_init();
this function is also being used to parse other kind of tables but
the "ACPI" tag would mean that the function was first for ACPI only.

int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);

/*
 * If boot cpu has not been detected yet, then only allow upto
 * nr_cpu_ids - 1 processors and keep one slot free for boot cpu
 */
if (!boot_cpu_detected && num_processors >= nr_cpu_ids - 1 &&
apicid != boot_cpu_physical_apicid) {
int thiscpu = max + disabled_cpus - 1;

pr_warning(
"ACPI: NR_CPUS/possible_cpus limit of %i almost"
" reached. Keeping one slot for boot cpu."
"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);

disabled_cpus++;
return -ENODEV;
}

if (num_processors >= nr_cpu_ids) {
int thiscpu = max + disabled_cpus;

pr_warning(
"ACPI: NR_CPUS/possible_cpus limit of %i reached."
"  Processor %d/0x%x ignored.\n", max, thiscpu, apicid);

disabled_cpus++;
return -EINVAL;
}


--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/apic] x86, apic, kexec: Add disable_cpu_apicid kernel parameter

2014-01-15 Thread tip-bot for HATAYAMA Daisuke
Commit-ID:  151e0c7de616310f95393d9306903900fcd8b277
Gitweb: http://git.kernel.org/tip/151e0c7de616310f95393d9306903900fcd8b277
Author: HATAYAMA Daisuke 
AuthorDate: Wed, 15 Jan 2014 15:44:58 +0900
Committer:  H. Peter Anvin 
CommitDate: Wed, 15 Jan 2014 09:19:20 -0800

x86, apic, kexec: Add disable_cpu_apicid kernel parameter

Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.

This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.

Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.

However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.

This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.

One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.

In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.

Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.

Signed-off-by: HATAYAMA Daisuke 
Link: 
http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin 
---
 Documentation/kernel-parameters.txt |  9 +++
 arch/x86/kernel/apic/apic.c | 49 +
 2 files changed, 58 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index b9e9bd8..6fdbf8c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
disable=[IPV6]
See Documentation/networking/ipv6.txt.
 
+   disable_cpu_apicid= [X86,APIC,SMP]
+   Format: 
+   The number of initial APIC ID for the
+   corresponding CPU to be disabled at boot,
+   mostly used for the kdump 2nd kernel to
+   disable BSP to wake up multiple CPUs without
+   causing system reset or hang due to sending
+   INIT from AP to BSP.
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4ec1dd6..e78ab8c 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -75,6 +75,13 @@ unsigned int max_physical_apicid;
 physid_mask_t phys_cpu_present_map;
 
 /*
+ * Processor to be disabled specified by kernel parameter
+ * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to
+ * avoid undefined behaviour caused by sending INIT from AP to BSP.
+ */
+unsigned int disabled_cpu_apicid = BAD_APICID;
+
+/*
  * Map cpu index to physical APIC ID
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -2114,6 +2121,39 @@ int generic_processor_info(int apicid, int version)
phys_cpu_present_map);
 
/*
+* boot_cpu_physical_apicid is designed to have the apicid
+* returned by read_apic_id(), i.e, the apicid of the
+* currently booting-up processor. However, on some platforms,
+* it is temporarilly modified by the apicid reported as BSP
+* through MP table. Concretely:
+*
+* - arch/x86/kernel/mpparse.c: MP_processor_info()
+* - arch/x86/mm/amdtopology.c: amd_numa_init()
+* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info()
+*
+* This function is executed with the modified
+* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel
+* parameter doesn't work to disable APs on kdump 2nd kernel.
+*
+* Since fixing handling

[tip:x86/apic] x86, apic, kexec: Add disable_cpu_apicid kernel parameter

2014-01-15 Thread tip-bot for HATAYAMA Daisuke
Commit-ID:  151e0c7de616310f95393d9306903900fcd8b277
Gitweb: http://git.kernel.org/tip/151e0c7de616310f95393d9306903900fcd8b277
Author: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
AuthorDate: Wed, 15 Jan 2014 15:44:58 +0900
Committer:  H. Peter Anvin h...@zytor.com
CommitDate: Wed, 15 Jan 2014 09:19:20 -0800

x86, apic, kexec: Add disable_cpu_apicid kernel parameter

Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.

This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.

Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.

However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.

This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.

One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.

In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.

Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Link: 
http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin h...@zytor.com
---
 Documentation/kernel-parameters.txt |  9 +++
 arch/x86/kernel/apic/apic.c | 49 +
 2 files changed, 58 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index b9e9bd8..6fdbf8c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
disable=[IPV6]
See Documentation/networking/ipv6.txt.
 
+   disable_cpu_apicid= [X86,APIC,SMP]
+   Format: int
+   The number of initial APIC ID for the
+   corresponding CPU to be disabled at boot,
+   mostly used for the kdump 2nd kernel to
+   disable BSP to wake up multiple CPUs without
+   causing system reset or hang due to sending
+   INIT from AP to BSP.
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4ec1dd6..e78ab8c 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -75,6 +75,13 @@ unsigned int max_physical_apicid;
 physid_mask_t phys_cpu_present_map;
 
 /*
+ * Processor to be disabled specified by kernel parameter
+ * disable_cpu_apicid=int, mostly used for the kdump 2nd kernel to
+ * avoid undefined behaviour caused by sending INIT from AP to BSP.
+ */
+unsigned int disabled_cpu_apicid = BAD_APICID;
+
+/*
  * Map cpu index to physical APIC ID
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -2114,6 +2121,39 @@ int generic_processor_info(int apicid, int version)
phys_cpu_present_map);
 
/*
+* boot_cpu_physical_apicid is designed to have the apicid
+* returned by read_apic_id(), i.e, the apicid of the
+* currently booting-up processor. However, on some platforms,
+* it is temporarilly modified by the apicid reported as BSP
+* through MP table. Concretely:
+*
+* - arch/x86/kernel/mpparse.c: MP_processor_info()
+* - arch/x86/mm/amdtopology.c: amd_numa_init()
+* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info()
+*
+* This function is executed with the modified
+* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel
+* parameter doesn't work to disable APs

Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

2014-01-15 Thread HATAYAMA Daisuke

(2014/01/16 6:09), tip-bot for H. Peter Anvin wrote:

Commit-ID:  5b4d1dbc24bb6fd7179ada0f47be34e27e64decb
Gitweb: http://git.kernel.org/tip/5b4d1dbc24bb6fd7179ada0f47be34e27e64decb
Author: H. Peter Anvin h...@zytor.com
AuthorDate: Wed, 15 Jan 2014 13:02:08 -0800
Committer:  H. Peter Anvin h...@zytor.com
CommitDate: Wed, 15 Jan 2014 13:02:08 -0800

x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

Make disabled_cpu_apicid static and read_mostly, and fix a couple of
typos.

Reported-by: Ingo Molnar mi...@kernel.org
Link: http://lkml.kernel.org/r/20140115182511.ga22...@gmail.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
Cc: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
  arch/x86/kernel/apic/apic.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e78ab8c..7f26c9a 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -79,7 +79,7 @@ physid_mask_t phys_cpu_present_map;
   * disable_cpu_apicid=int, mostly used for the kdump 2nd kernel to
   * avoid undefined behaviour caused by sending INIT from AP to BSP.
   */
-unsigned int disabled_cpu_apicid = BAD_APICID;
+static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;

  /*
   * Map cpu index to physical APIC ID
@@ -2124,7 +2124,7 @@ int generic_processor_info(int apicid, int version)
 * boot_cpu_physical_apicid is designed to have the apicid
 * returned by read_apic_id(), i.e, the apicid of the
 * currently booting-up processor. However, on some platforms,
-* it is temporarilly modified by the apicid reported as BSP
+* it is temporarily modified by the apicid reported as BSP
 * through MP table. Concretely:
 *
 * - arch/x86/kernel/mpparse.c: MP_processor_info()
@@ -2145,7 +2145,7 @@ int generic_processor_info(int apicid, int version)
disabled_cpu_apicid == apicid) {
int thiscpu = num_processors + disabled_cpus;

-   pr_warning(ACPI: Disabling requested cpu.
+   pr_warning(APIC: Disabling requested cpu.
Processor %d/0x%x ignored.\n,
   thiscpu, apicid);


This is not typo in my intention.

generic_processor_info() has two more cases where it ignores cpus.
In either cases, printed messages are tagged with ACPI because this
function is called when parsing ACPI MADT table in acpi_boot_init();
this function is also being used to parse other kind of tables but
the ACPI tag would mean that the function was first for ACPI only.

int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);

/*
 * If boot cpu has not been detected yet, then only allow upto
 * nr_cpu_ids - 1 processors and keep one slot free for boot cpu
 */
if (!boot_cpu_detected  num_processors = nr_cpu_ids - 1 
apicid != boot_cpu_physical_apicid) {
int thiscpu = max + disabled_cpus - 1;

pr_warning(
ACPI: NR_CPUS/possible_cpus limit of %i almost
 reached. Keeping one slot for boot cpu.
  Processor %d/0x%x ignored.\n, max, thiscpu, apicid);

disabled_cpus++;
return -ENODEV;
}

if (num_processors = nr_cpu_ids) {
int thiscpu = max + disabled_cpus;

pr_warning(
ACPI: NR_CPUS/possible_cpus limit of %i reached.
  Processor %d/0x%x ignored.\n, max, thiscpu, apicid);

disabled_cpus++;
return -EINVAL;
}


--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/apic] x86, apic: Make disabled_cpu_apicid static read_mostly, fix typos

2014-01-15 Thread HATAYAMA Daisuke

(2014/01/16 14:53), H. Peter Anvin wrote:

On 01/15/2014 08:44 PM, HATAYAMA Daisuke wrote:


This is not typo in my intention.

generic_processor_info() has two more cases where it ignores cpus.
In either cases, printed messages are tagged with ACPI because this
function is called when parsing ACPI MADT table in acpi_boot_init();
this function is also being used to parse other kind of tables but
the ACPI tag would mean that the function was first for ACPI only.



But it has nothing to do with ACPI -- it is an APIC ID from the command
line -- so that would be actively misleading.

-hpa



I never disagree to the fix itself. I wanted to explain why I wrote so.

I thought it was better to unify tags in the same function because they
should bleong to the same component, here I mean ACPI, but it's better
to avoid the confusion, which is bigger impact.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH v10] x86, apic, kexec, Documentation: Add disable_cpu_apicid kernel parameter

2014-01-14 Thread HATAYAMA Daisuke
Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.

This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.

Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.

However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.

This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.

One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.

In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.

Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.

Signed-off-by: HATAYAMA Daisuke 
---
 Documentation/kernel-parameters.txt |9 ++
 arch/x86/kernel/apic/apic.c |   49 +++
 2 files changed, 58 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 50680a5..4e5528c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
disable=[IPV6]
See Documentation/networking/ipv6.txt.
 
+   disable_cpu_apicid= [X86,APIC,SMP]
+   Format: 
+   The number of initial APIC ID for the
+   corresponding CPU to be disabled at boot,
+   mostly used for the kdump 2nd kernel to
+   disable BSP to wake up multiple CPUs without
+   causing system reset or hang due to sending
+   INIT from AP to BSP.
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d278736..6c0b7d5 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -75,6 +75,13 @@ unsigned int max_physical_apicid;
 physid_mask_t phys_cpu_present_map;
 
 /*
+ * Processor to be disabled specified by kernel parameter
+ * disable_cpu_apicid=, mostly used for the kdump 2nd kernel to
+ * avoid undefined behaviour caused by sending INIT from AP to BSP.
+ */
+unsigned int disabled_cpu_apicid = BAD_APICID;
+
+/*
  * Map cpu index to physical APIC ID
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -2115,6 +2122,39 @@ int generic_processor_info(int apicid, int version)
phys_cpu_present_map);
 
/*
+* boot_cpu_physical_apicid is designed to have the apicid
+* returned by read_apic_id(), i.e, the apicid of the
+* currently booting-up processor. However, on some platforms,
+* it is temporarilly modified by the apicid reported as BSP
+* through MP table. Concretely:
+*
+* - arch/x86/kernel/mpparse.c: MP_processor_info()
+* - arch/x86/mm/amdtopology.c: amd_numa_init()
+* - arch/x86/platform/visws/visws_quirks.c: MP_processor_info()
+*
+* This function is executed with the modified
+* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel
+* parameter doesn't work to disable APs on kdump 2nd kernel.
+*
+* Since fixing handling of boot_cpu_physical_apicid requires
+* another discussion and tests on each platform, we leave it
+* for now and here we use read_apic_id() directly in this
+* function, generic_processor_info().
+*/
+   if (disabled_cpu_apicid != BAD_APICID &&
+   disabled_cpu_apicid != read_apic_id() &&
+   disabled_cpu_apicid == apicid) {
+   int thiscpu = num_processors + disabled_cpus;
+
+   

  1   2   3   4   5   6   7   8   >