Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Vivek Goyal
On Tue, Jul 16, 2013 at 05:37:09PM +0200, Michael Holzheu wrote:

[..]
> The problem is that with the mmap patches we now use copy_oldmem_page()
> to copy the notes from oldmem into the notes_buf which has been allocated
> with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
> tables and uses real copy. And this does not work on vmalloc memory.

I got that. I am trying to remember that in 3.10 why are we dropping
to real mode in s390 while copying pages.

I thought dropping to real mode was needed only to access HSA region.
HSA region comes into the picture only for zfcudump and /proc/vmcore support
for zfcpdump is not there yet

IOW, what if we start copying with page tables enabled in s390. Will
it not work for regular kdump case and take care of zfcpdump in 3.12.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Michael Holzheu
On Tue, 16 Jul 2013 10:04:18 -0400
Vivek Goyal  wrote:

> On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:
> 
> [..]
> > > > Hello Vivek and Andrew,
> > > > 
> > > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
> > > > currently
> > > > breaks s390 kdump because of the following two issues:
> > > > 
> > > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > > > 2) The mmap() implementation is not compatible with the current
> > > >s390 crashkernel swap:
> > > >See: http://marc.info/?l=kexec=136940802511603=2
> > > > 
> > > > The "kdump: Introduce ELF header in new memory feature" patch series 
> > > > will
> > > > fix both issues for s390.
> > > > 
> > > > There is the one small open discussion left:
> > > > 
> > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
> > > > 
> > > > But once we have finished that, would it be possible to get the
> > > > patches in 3.11?
> > > 
> > > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> > > mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> > > way there will be no regression and mmap() related speedup will show up
> > > in next release on s390.
> > 
> > Hello Vivek and Hatayama,
> > 
> > But then we still would have to somehow fix the copy_oldmem_page() issue 
> > (1).
> > 
> > We would prefer to add the current patch series with "#ifndef CONFIG_S390" 
> > in
> > the fault handler.
> > 
> > @Vivek:
> > 
> > Since you are the kdump maintainer, could you tell us which of the both
> > variants you would like to have?
> > 
> > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
> > *vmf)
> > {
> > ...
> > 
> > #ifndef CONFIG_S390
> > return VM_FAULT_SIGBUS;
> > #endif
> 
> I think let us use this one. Kill the process on non-s390 arch if it goes
> into mmap() fault handler.
> 
> copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
> I think post the series again and let us see if Andrew is comfortable
> with it.

Ok great, I will send Andrew the current patch series.
Let's see what happens...

> I suspect it is late.
> IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
> case.

The problem is that with the mmap patches we now use copy_oldmem_page()
to copy the notes from oldmem into the notes_buf which has been allocated
with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
tables and uses real copy. And this does not work on vmalloc memory.

Best Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Vivek Goyal
On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:

[..]
> > > Hello Vivek and Andrew,
> > > 
> > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
> > > currently
> > > breaks s390 kdump because of the following two issues:
> > > 
> > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > > 2) The mmap() implementation is not compatible with the current
> > >s390 crashkernel swap:
> > >See: http://marc.info/?l=kexec=136940802511603=2
> > > 
> > > The "kdump: Introduce ELF header in new memory feature" patch series will
> > > fix both issues for s390.
> > > 
> > > There is the one small open discussion left:
> > > 
> > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
> > > 
> > > But once we have finished that, would it be possible to get the
> > > patches in 3.11?
> > 
> > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> > mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> > way there will be no regression and mmap() related speedup will show up
> > in next release on s390.
> 
> Hello Vivek and Hatayama,
> 
> But then we still would have to somehow fix the copy_oldmem_page() issue (1).
> 
> We would prefer to add the current patch series with "#ifndef CONFIG_S390" in
> the fault handler.
> 
> @Vivek:
> 
> Since you are the kdump maintainer, could you tell us which of the both
> variants you would like to have?
> 
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> 
> #ifndef CONFIG_S390
>   return VM_FAULT_SIGBUS;
> #endif

I think let us use this one. Kill the process on non-s390 arch if it goes
into mmap() fault handler.

copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
I think post the series again and let us see if Andrew is comfortable
with it. I suspect it is late.

IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
case. Also I think in case of zfcpdump, /proc/vmcore is not usable yet
and your patches achieves that. So as an stop gap measure can we stop
dropping to real mode so that regular kdump in s390 work. (I am not sure
in case of zfcpdump, currently how do you retrieve vmcore as /dev/oldmem
is gone now).

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread HATAYAMA Daisuke

(2013/07/16 9:27), HATAYAMA Daisuke wrote:

(2013/07/15 23:20), Vivek Goyal wrote:

On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]

How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 return VM_FAULT_SIGBUS;
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.


I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?



On the current design, there's no page faults on memory mapped by 
remap_pfn_range().
They map a whole range in the current design. If there are page faults, page 
table of the process
is broken in their some page entries. This indicates the process's bahaviour is 
affected by
some software/hardware bugs. In theory, process could result in arbitrary 
behaviour. We cannot
detect the reason and recover the original sane state. The only thing we can do 
is to kill
the process and drop the possibility of the process to affect other system 
components and of
system to result in worse situation.



In summary, it seems that you two and I have different implementation
policy on how to deal with the process that is no longer in healthy state.

You two's idea is try to continue dump in non-healthy state as much as possible
as long as there is possibility of continuing it, while my idea kill the process
promptly and to retry crash dump in another new process since the process is no 
longer
in healthy state and could behave arbitrarily.

The logic in non-healthy states depends on implementation policy since there
is no obviously correct logic. I guess this discussion would not end soon.
I believe it is supposed that maintainer's idea should basically have high
priority over others. So I don't object anymore, though I don't think it best
at all.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Michael Holzheu
On Mon, 15 Jul 2013 10:27:08 -0400
Vivek Goyal  wrote:

> On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
> > On Tue, 2 Jul 2013 11:42:14 -0400
> > Vivek Goyal  wrote:
> > 
> > > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > > > For zfcpdump we can't map the HSA storage because it is only available
> > > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > > introduce a new mechanism to create mappings on demand.
> > > > 
> > > > This patch introduces a new architecture function 
> > > > remap_oldmem_pfn_range()
> > > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > > areas that can be directly mapped. For zfcpdump this is everything 
> > > > besides
> > > > of the HSA memory. For the areas that are not mapped by 
> > > > remap_oldmem_pfn_range()
> > > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > > is called.
> > > > 
> > > > This handler works as follows:
> > > > 
> > > > * Get already available or new page from page cache 
> > > > (find_or_create_page)
> > > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > > * If yes:
> > > >   Return that page
> > > > * If no:
> > > >   Fill page using __vmcore_read(), set PageUptodate, and return page
> > > > 
> > > > Signed-off-by: Michael Holzheu 
> > > 
> > > In general vmcore related changes look fine to me. I am not very familiar
> > > with the logic of finding pages in page cache and using page uptodate
> > > flag.
> > > 
> > > Hatayama, can you please review it.
> > > 
> > > Acked-by: Vivek Goyal 
> > 
> > Hello Vivek and Andrew,
> > 
> > We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
> > currently
> > breaks s390 kdump because of the following two issues:
> > 
> > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > 2) The mmap() implementation is not compatible with the current
> >s390 crashkernel swap:
> >See: http://marc.info/?l=kexec=136940802511603=2
> > 
> > The "kdump: Introduce ELF header in new memory feature" patch series will
> > fix both issues for s390.
> > 
> > There is the one small open discussion left:
> > 
> > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
> > 
> > But once we have finished that, would it be possible to get the
> > patches in 3.11?
> 
> How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> way there will be no regression and mmap() related speedup will show up
> in next release on s390.

Hello Vivek and Hatayama,

But then we still would have to somehow fix the copy_oldmem_page() issue (1).

We would prefer to add the current patch series with "#ifndef CONFIG_S390" in
the fault handler.

@Vivek:

Since you are the kdump maintainer, could you tell us which of the both
variants you would like to have?

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...

#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif

or 

#ifndef CONFIG_S390
WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif

For all architectures besides of s390 this would implement the requested 
behavior.

Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Michael Holzheu
On Mon, 15 Jul 2013 10:27:08 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
  On Tue, 2 Jul 2013 11:42:14 -0400
  Vivek Goyal vgo...@redhat.com wrote:
  
   On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function 
remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything 
besides
of the HSA memory. For the areas that are not mapped by 
remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.

This handler works as follows:

* Get already available or new page from page cache 
(find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
   
   In general vmcore related changes look fine to me. I am not very familiar
   with the logic of finding pages in page cache and using page uptodate
   flag.
   
   Hatayama, can you please review it.
   
   Acked-by: Vivek Goyal vgo...@redhat.com
  
  Hello Vivek and Andrew,
  
  We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
  currently
  breaks s390 kdump because of the following two issues:
  
  1) The copy_oldmem_page() is now used for copying to vmalloc memory
  2) The mmap() implementation is not compatible with the current
 s390 crashkernel swap:
 See: http://marc.info/?l=kexecm=136940802511603w=2
  
  The kdump: Introduce ELF header in new memory feature patch series will
  fix both issues for s390.
  
  There is the one small open discussion left:
  
  http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
  
  But once we have finished that, would it be possible to get the
  patches in 3.11?
 
 How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
 mmap() on s390 forcing makedumpfile to fall back on read() interface. That
 way there will be no regression and mmap() related speedup will show up
 in next release on s390.

Hello Vivek and Hatayama,

But then we still would have to somehow fix the copy_oldmem_page() issue (1).

We would prefer to add the current patch series with #ifndef CONFIG_S390 in
the fault handler.

@Vivek:

Since you are the kdump maintainer, could you tell us which of the both
variants you would like to have?

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...

#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif

or 

#ifndef CONFIG_S390
WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault());
#endif

For all architectures besides of s390 this would implement the requested 
behavior.

Regards,
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread HATAYAMA Daisuke

(2013/07/16 9:27), HATAYAMA Daisuke wrote:

(2013/07/15 23:20), Vivek Goyal wrote:

On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]

How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 return VM_FAULT_SIGBUS;
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.


I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?



On the current design, there's no page faults on memory mapped by 
remap_pfn_range().
They map a whole range in the current design. If there are page faults, page 
table of the process
is broken in their some page entries. This indicates the process's bahaviour is 
affected by
some software/hardware bugs. In theory, process could result in arbitrary 
behaviour. We cannot
detect the reason and recover the original sane state. The only thing we can do 
is to kill
the process and drop the possibility of the process to affect other system 
components and of
system to result in worse situation.



In summary, it seems that you two and I have different implementation
policy on how to deal with the process that is no longer in healthy state.

You two's idea is try to continue dump in non-healthy state as much as possible
as long as there is possibility of continuing it, while my idea kill the process
promptly and to retry crash dump in another new process since the process is no 
longer
in healthy state and could behave arbitrarily.

The logic in non-healthy states depends on implementation policy since there
is no obviously correct logic. I guess this discussion would not end soon.
I believe it is supposed that maintainer's idea should basically have high
priority over others. So I don't object anymore, though I don't think it best
at all.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Vivek Goyal
On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:

[..]
   Hello Vivek and Andrew,
   
   We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
   currently
   breaks s390 kdump because of the following two issues:
   
   1) The copy_oldmem_page() is now used for copying to vmalloc memory
   2) The mmap() implementation is not compatible with the current
  s390 crashkernel swap:
  See: http://marc.info/?l=kexecm=136940802511603w=2
   
   The kdump: Introduce ELF header in new memory feature patch series will
   fix both issues for s390.
   
   There is the one small open discussion left:
   
   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
   
   But once we have finished that, would it be possible to get the
   patches in 3.11?
  
  How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
  mmap() on s390 forcing makedumpfile to fall back on read() interface. That
  way there will be no regression and mmap() related speedup will show up
  in next release on s390.
 
 Hello Vivek and Hatayama,
 
 But then we still would have to somehow fix the copy_oldmem_page() issue (1).
 
 We would prefer to add the current patch series with #ifndef CONFIG_S390 in
 the fault handler.
 
 @Vivek:
 
 Since you are the kdump maintainer, could you tell us which of the both
 variants you would like to have?
 
 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 ...
 
 #ifndef CONFIG_S390
   return VM_FAULT_SIGBUS;
 #endif

I think let us use this one. Kill the process on non-s390 arch if it goes
into mmap() fault handler.

copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
I think post the series again and let us see if Andrew is comfortable
with it. I suspect it is late.

IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
case. Also I think in case of zfcpdump, /proc/vmcore is not usable yet
and your patches achieves that. So as an stop gap measure can we stop
dropping to real mode so that regular kdump in s390 work. (I am not sure
in case of zfcpdump, currently how do you retrieve vmcore as /dev/oldmem
is gone now).

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Michael Holzheu
On Tue, 16 Jul 2013 10:04:18 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:
 
 [..]
Hello Vivek and Andrew,

We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
currently
breaks s390 kdump because of the following two issues:

1) The copy_oldmem_page() is now used for copying to vmalloc memory
2) The mmap() implementation is not compatible with the current
   s390 crashkernel swap:
   See: http://marc.info/?l=kexecm=136940802511603w=2

The kdump: Introduce ELF header in new memory feature patch series 
will
fix both issues for s390.

There is the one small open discussion left:

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html

But once we have finished that, would it be possible to get the
patches in 3.11?
   
   How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
   mmap() on s390 forcing makedumpfile to fall back on read() interface. That
   way there will be no regression and mmap() related speedup will show up
   in next release on s390.
  
  Hello Vivek and Hatayama,
  
  But then we still would have to somehow fix the copy_oldmem_page() issue 
  (1).
  
  We would prefer to add the current patch series with #ifndef CONFIG_S390 
  in
  the fault handler.
  
  @Vivek:
  
  Since you are the kdump maintainer, could you tell us which of the both
  variants you would like to have?
  
  static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
  *vmf)
  {
  ...
  
  #ifndef CONFIG_S390
  return VM_FAULT_SIGBUS;
  #endif
 
 I think let us use this one. Kill the process on non-s390 arch if it goes
 into mmap() fault handler.
 
 copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
 I think post the series again and let us see if Andrew is comfortable
 with it.

Ok great, I will send Andrew the current patch series.
Let's see what happens...

 I suspect it is late.
 IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
 case.

The problem is that with the mmap patches we now use copy_oldmem_page()
to copy the notes from oldmem into the notes_buf which has been allocated
with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
tables and uses real copy. And this does not work on vmalloc memory.

Best Regards,
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-16 Thread Vivek Goyal
On Tue, Jul 16, 2013 at 05:37:09PM +0200, Michael Holzheu wrote:

[..]
 The problem is that with the mmap patches we now use copy_oldmem_page()
 to copy the notes from oldmem into the notes_buf which has been allocated
 with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
 tables and uses real copy. And this does not work on vmalloc memory.

I got that. I am trying to remember that in 3.10 why are we dropping
to real mode in s390 while copying pages.

I thought dropping to real mode was needed only to access HSA region.
HSA region comes into the picture only for zfcudump and /proc/vmcore support
for zfcpdump is not there yet

IOW, what if we start copying with page tables enabled in s390. Will
it not work for regular kdump case and take care of zfcpdump in 3.12.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread HATAYAMA Daisuke

(2013/07/15 18:21), Martin Schwidefsky wrote:

On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke  wrote:


(2013/07/10 20:00), Michael Holzheu wrote:

On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke  wrote:

[snip]


(2013/07/10 17:42), Michael Holzheu wrote:

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.


Currently it is 32 MiB and normally it is read only once.



Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:



What you think the sanity check is unnecessary is perfectly wrong. You design 
page faults
always happens on HSA region. If page fault happens on the other parts, i.e. 
some point
of mmap()ed region, it means somehow page table on the address has not been 
created. This
is bug, possibly caused by mmap() itself, page table creation, other components 
in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
of bug occurs
now. There's no guarantee that program runs safely, of course for page cache 
creation, too.
We cannot and must expect such buggy process to behave in invalid states just 
as our design.
It results in undefined behaviour. The only thing we can do is to kill the 
buggy process
as soon as possible.


I don't quite get this point, please bear with me. If you compare the situation 
before and
after the introduction of the fault handler the possible error scenarios are 
not almost
identical:
1) If an access is made outside of the mapped memory region the first level 
fault handler
(do_exception for s390, __do_page_fault for x86) won't find a vma and force 
a SIGSEGV
right away, independent of the existance of a hole and the vmcore fault 
handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends 
on how the
entries are corrupted. If the outcome is a valid pte an incorrect memory 
area will be
accessed, the same with or without the vmcore fault handler. If the 
corrupted pte is
an invalid pte it can come out as swap pte, file pte, or as empty pte. The 
behaviour
does not change for swap and file ptes, you will get funny results in both 
cases.
For empty ptes the new behaviour will call the vmcore fault handler for the 
address
in question. If the read() function can satisfy the request we will get a 
page cache
copy of the missing page, if the read function can not satisfy the request 
it returns
an error which is translated to a SIGBUS.
This new behaviour is IMHO better than the old one, it successfully 
recovers from at
least one type of corruption. For x86 that would be the case if the page 
table is
overwritten with zeroes, for s390 a specific bit pattern in the pte is 
required.


As you already noticed here, this senario works well only if there's page table 
corruption
only. Process could be corrupted more.


3) In the case of a programming error in regard to remap_pfn_range the new 
behaviour will
provide page cache copies and the dump will technically be correct. The 
performance
might suffer a little bit as the CPU will have to create the page cache 
copies but
compared to the I/O that is involved with writing a dump this is 
negligible, no?

It seems to me that the warning message you want to see in the fault handler 
would be
a debugging aid for the developer to see if the mmap() and the 
remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be 
appropriate, no?



warning message is meaningless, which doesn't stop process. I no longer 
consider debugging.
It's valid to kill the process as soon as possible. If there's page fault to 
the address we
don't intend, it's buggy, and so we don't guess how the buggy process behave. 
Please consider
the cases that system goes into more catastrophic situation after we leave the 
process running to
get warning messages for debugging purposes.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread HATAYAMA Daisuke

(2013/07/15 23:20), Vivek Goyal wrote:

On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]

How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 return VM_FAULT_SIGBUS;
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.


I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?



On the current design, there's no page faults on memory mapped by 
remap_pfn_range().
They map a whole range in the current design. If there are page faults, page 
table of the process
is broken in their some page entries. This indicates the process's bahaviour is 
affected by
some software/hardware bugs. In theory, process could result in arbitrary 
behaviour. We cannot
detect the reason and recover the original sane state. The only thing we can do 
is to kill
the process and drop the possibility of the process to affect other system 
components and of
system to result in worse situation.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Vivek Goyal
On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
> On Tue, 2 Jul 2013 11:42:14 -0400
> Vivek Goyal  wrote:
> 
> > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > > For zfcpdump we can't map the HSA storage because it is only available
> > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > introduce a new mechanism to create mappings on demand.
> > > 
> > > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > areas that can be directly mapped. For zfcpdump this is everything besides
> > > of the HSA memory. For the areas that are not mapped by 
> > > remap_oldmem_pfn_range()
> > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > is called.
> > > 
> > > This handler works as follows:
> > > 
> > > * Get already available or new page from page cache (find_or_create_page)
> > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > * If yes:
> > >   Return that page
> > > * If no:
> > >   Fill page using __vmcore_read(), set PageUptodate, and return page
> > > 
> > > Signed-off-by: Michael Holzheu 
> > 
> > In general vmcore related changes look fine to me. I am not very familiar
> > with the logic of finding pages in page cache and using page uptodate
> > flag.
> > 
> > Hatayama, can you please review it.
> > 
> > Acked-by: Vivek Goyal 
> 
> Hello Vivek and Andrew,
> 
> We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
> currently
> breaks s390 kdump because of the following two issues:
> 
> 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> 2) The mmap() implementation is not compatible with the current
>s390 crashkernel swap:
>See: http://marc.info/?l=kexec=136940802511603=2
> 
> The "kdump: Introduce ELF header in new memory feature" patch series will
> fix both issues for s390.
> 
> There is the one small open discussion left:
> 
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
> 
> But once we have finished that, would it be possible to get the
> patches in 3.11?

How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
mmap() on s390 forcing makedumpfile to fall back on read() interface. That
way there will be no regression and mmap() related speedup will show up
in next release on s390.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Vivek Goyal
On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]
> How about
> 
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> char *buf;
> int rc;
> 
> #ifndef CONFIG_S390
> return VM_FAULT_SIGBUS;
> #endif
> page = find_or_create_page(mapping, index, GFP_KERNEL);
> 
> Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
> occurs on
> mmap() region indicates some kind of buggy situation occurs on the process. 
> The process
> should be killed as soon as possible. If user still wants to get crash dump, 
> he should
> try again in another process.

I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of 
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Michael Holzheu
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal  wrote:

> On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > For zfcpdump we can't map the HSA storage because it is only available
> > via a read interface. Therefore, for the new vmcore mmap feature we have
> > introduce a new mechanism to create mappings on demand.
> > 
> > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > that should be used to create mappings with remap_pfn_range() for oldmem
> > areas that can be directly mapped. For zfcpdump this is everything besides
> > of the HSA memory. For the areas that are not mapped by 
> > remap_oldmem_pfn_range()
> > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > is called.
> > 
> > This handler works as follows:
> > 
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> >   Return that page
> > * If no:
> >   Fill page using __vmcore_read(), set PageUptodate, and return page
> > 
> > Signed-off-by: Michael Holzheu 
> 
> In general vmcore related changes look fine to me. I am not very familiar
> with the logic of finding pages in page cache and using page uptodate
> flag.
> 
> Hatayama, can you please review it.
> 
> Acked-by: Vivek Goyal 

Hello Vivek and Andrew,

We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
currently
breaks s390 kdump because of the following two issues:

1) The copy_oldmem_page() is now used for copying to vmalloc memory
2) The mmap() implementation is not compatible with the current
   s390 crashkernel swap:
   See: http://marc.info/?l=kexec=136940802511603=2

The "kdump: Introduce ELF header in new memory feature" patch series will
fix both issues for s390.

There is the one small open discussion left:

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html

But once we have finished that, would it be possible to get the
patches in 3.11?

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Martin Schwidefsky
On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke  wrote:

> (2013/07/10 20:00), Michael Holzheu wrote:
> > On Wed, 10 Jul 2013 18:50:18 +0900
> > HATAYAMA Daisuke  wrote:
> >
> > [snip]
> >
> >> (2013/07/10 17:42), Michael Holzheu wrote:
> >>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has 
> >>> the same
> >>> effect as your suggestion for all architectures besides of s390. And for 
> >>> s390 we
> >>> take the risk that a programming error would result in poor /proc/vmcore
> >>> performance.
> >>>
> >>
> >> If you want to avoid looking up vmcore_list that takes linear time w.r.t. 
> >> the number
> >> of the elements, you can still calculate the range of offsets in 
> >> /proc/vmcore
> >> corresponding to HSA during /proc/vmcore initialization.
> >>
> >> Also, could you tell me how often and how much the HSA region is during 
> >> crash dumping?
> >> I guess the read to HSA is done mainly during early part of crash dumping 
> >> process only.
> >> According to the code, it appears at most 64MiB only. Then, I feel 
> >> performance is not
> >> a big issue.
> >
> > Currently it is 32 MiB and normally it is read only once.
> >
> >>
> >> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
> >> calls. I don't
> >> think it too much overhead...
> >
> > I was more concerned about in_valid_fault_range(). But I was most concerned 
> > the additional
> > interface that introduces more complexity to the code. And that just to 
> > implement a
> > sanity check that in our opinion we don't really need.
> >
> > And what makes it even worse:
> >
> 
> What you think the sanity check is unnecessary is perfectly wrong. You design 
> page faults
> always happens on HSA region. If page fault happens on the other parts, i.e. 
> some point
> of mmap()ed region, it means somehow page table on the address has not been 
> created. This
> is bug, possibly caused by mmap() itself, page table creation, other 
> components in kernel,
> bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
> of bug occurs
> now. There's no guarantee that program runs safely, of course for page cache 
> creation, too.
> We cannot and must expect such buggy process to behave in invalid states just 
> as our design.
> It results in undefined behaviour. The only thing we can do is to kill the 
> buggy process
> as soon as possible.

I don't quite get this point, please bear with me. If you compare the situation 
before and
after the introduction of the fault handler the possible error scenarios are 
not almost
identical:
1) If an access is made outside of the mapped memory region the first level 
fault handler
   (do_exception for s390, __do_page_fault for x86) won't find a vma and force 
a SIGSEGV
   right away, independent of the existance of a hole and the vmcore fault 
handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends 
on how the
   entries are corrupted. If the outcome is a valid pte an incorrect memory 
area will be
   accessed, the same with or without the vmcore fault handler. If the 
corrupted pte is
   an invalid pte it can come out as swap pte, file pte, or as empty pte. The 
behaviour
   does not change for swap and file ptes, you will get funny results in both 
cases.
   For empty ptes the new behaviour will call the vmcore fault handler for the 
address
   in question. If the read() function can satisfy the request we will get a 
page cache
   copy of the missing page, if the read function can not satisfy the request 
it returns
   an error which is translated to a SIGBUS.
   This new behaviour is IMHO better than the old one, it successfully recovers 
from at
   least one type of corruption. For x86 that would be the case if the page 
table is
   overwritten with zeroes, for s390 a specific bit pattern in the pte is 
required.
3) In the case of a programming error in regard to remap_pfn_range the new 
behaviour will
   provide page cache copies and the dump will technically be correct. The 
performance
   might suffer a little bit as the CPU will have to create the page cache 
copies but
   compared to the I/O that is involved with writing a dump this is negligible, 
no?

It seems to me that the warning message you want to see in the fault handler 
would be
a debugging aid for the developer to see if the mmap() and the 
remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be 
appropriate, no?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Martin Schwidefsky
On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

 (2013/07/10 20:00), Michael Holzheu wrote:
  On Wed, 10 Jul 2013 18:50:18 +0900
  HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
  [snip]
 
  (2013/07/10 17:42), Michael Holzheu wrote:
  My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has 
  the same
  effect as your suggestion for all architectures besides of s390. And for 
  s390 we
  take the risk that a programming error would result in poor /proc/vmcore
  performance.
 
 
  If you want to avoid looking up vmcore_list that takes linear time w.r.t. 
  the number
  of the elements, you can still calculate the range of offsets in 
  /proc/vmcore
  corresponding to HSA during /proc/vmcore initialization.
 
  Also, could you tell me how often and how much the HSA region is during 
  crash dumping?
  I guess the read to HSA is done mainly during early part of crash dumping 
  process only.
  According to the code, it appears at most 64MiB only. Then, I feel 
  performance is not
  a big issue.
 
  Currently it is 32 MiB and normally it is read only once.
 
 
  Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
  calls. I don't
  think it too much overhead...
 
  I was more concerned about in_valid_fault_range(). But I was most concerned 
  the additional
  interface that introduces more complexity to the code. And that just to 
  implement a
  sanity check that in our opinion we don't really need.
 
  And what makes it even worse:
 
 
 What you think the sanity check is unnecessary is perfectly wrong. You design 
 page faults
 always happens on HSA region. If page fault happens on the other parts, i.e. 
 some point
 of mmap()ed region, it means somehow page table on the address has not been 
 created. This
 is bug, possibly caused by mmap() itself, page table creation, other 
 components in kernel,
 bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
 of bug occurs
 now. There's no guarantee that program runs safely, of course for page cache 
 creation, too.
 We cannot and must expect such buggy process to behave in invalid states just 
 as our design.
 It results in undefined behaviour. The only thing we can do is to kill the 
 buggy process
 as soon as possible.

I don't quite get this point, please bear with me. If you compare the situation 
before and
after the introduction of the fault handler the possible error scenarios are 
not almost
identical:
1) If an access is made outside of the mapped memory region the first level 
fault handler
   (do_exception for s390, __do_page_fault for x86) won't find a vma and force 
a SIGSEGV
   right away, independent of the existance of a hole and the vmcore fault 
handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends 
on how the
   entries are corrupted. If the outcome is a valid pte an incorrect memory 
area will be
   accessed, the same with or without the vmcore fault handler. If the 
corrupted pte is
   an invalid pte it can come out as swap pte, file pte, or as empty pte. The 
behaviour
   does not change for swap and file ptes, you will get funny results in both 
cases.
   For empty ptes the new behaviour will call the vmcore fault handler for the 
address
   in question. If the read() function can satisfy the request we will get a 
page cache
   copy of the missing page, if the read function can not satisfy the request 
it returns
   an error which is translated to a SIGBUS.
   This new behaviour is IMHO better than the old one, it successfully recovers 
from at
   least one type of corruption. For x86 that would be the case if the page 
table is
   overwritten with zeroes, for s390 a specific bit pattern in the pte is 
required.
3) In the case of a programming error in regard to remap_pfn_range the new 
behaviour will
   provide page cache copies and the dump will technically be correct. The 
performance
   might suffer a little bit as the CPU will have to create the page cache 
copies but
   compared to the I/O that is involved with writing a dump this is negligible, 
no?

It seems to me that the warning message you want to see in the fault handler 
would be
a debugging aid for the developer to see if the mmap() and the 
remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be 
appropriate, no?

-- 
blue skies,
   Martin.

Reality continues to ruin my life. - Calvin.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Michael Holzheu
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
  For zfcpdump we can't map the HSA storage because it is only available
  via a read interface. Therefore, for the new vmcore mmap feature we have
  introduce a new mechanism to create mappings on demand.
  
  This patch introduces a new architecture function remap_oldmem_pfn_range()
  that should be used to create mappings with remap_pfn_range() for oldmem
  areas that can be directly mapped. For zfcpdump this is everything besides
  of the HSA memory. For the areas that are not mapped by 
  remap_oldmem_pfn_range()
  a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
  is called.
  
  This handler works as follows:
  
  * Get already available or new page from page cache (find_or_create_page)
  * Check if /proc/vmcore page is filled with data (PageUptodate)
  * If yes:
Return that page
  * If no:
Fill page using __vmcore_read(), set PageUptodate, and return page
  
  Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
 
 In general vmcore related changes look fine to me. I am not very familiar
 with the logic of finding pages in page cache and using page uptodate
 flag.
 
 Hatayama, can you please review it.
 
 Acked-by: Vivek Goyal vgo...@redhat.com

Hello Vivek and Andrew,

We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
currently
breaks s390 kdump because of the following two issues:

1) The copy_oldmem_page() is now used for copying to vmalloc memory
2) The mmap() implementation is not compatible with the current
   s390 crashkernel swap:
   See: http://marc.info/?l=kexecm=136940802511603w=2

The kdump: Introduce ELF header in new memory feature patch series will
fix both issues for s390.

There is the one small open discussion left:

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html

But once we have finished that, would it be possible to get the
patches in 3.11?

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Vivek Goyal
On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]
 How about
 
 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 ...
 char *buf;
 int rc;
 
 #ifndef CONFIG_S390
 return VM_FAULT_SIGBUS;
 #endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);
 
 Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
 occurs on
 mmap() region indicates some kind of buggy situation occurs on the process. 
 The process
 should be killed as soon as possible. If user still wants to get crash dump, 
 he should
 try again in another process.

I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of 
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?

Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread Vivek Goyal
On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
 On Tue, 2 Jul 2013 11:42:14 -0400
 Vivek Goyal vgo...@redhat.com wrote:
 
  On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
   For zfcpdump we can't map the HSA storage because it is only available
   via a read interface. Therefore, for the new vmcore mmap feature we have
   introduce a new mechanism to create mappings on demand.
   
   This patch introduces a new architecture function remap_oldmem_pfn_range()
   that should be used to create mappings with remap_pfn_range() for oldmem
   areas that can be directly mapped. For zfcpdump this is everything besides
   of the HSA memory. For the areas that are not mapped by 
   remap_oldmem_pfn_range()
   a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
   is called.
   
   This handler works as follows:
   
   * Get already available or new page from page cache (find_or_create_page)
   * Check if /proc/vmcore page is filled with data (PageUptodate)
   * If yes:
 Return that page
   * If no:
 Fill page using __vmcore_read(), set PageUptodate, and return page
   
   Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
  
  In general vmcore related changes look fine to me. I am not very familiar
  with the logic of finding pages in page cache and using page uptodate
  flag.
  
  Hatayama, can you please review it.
  
  Acked-by: Vivek Goyal vgo...@redhat.com
 
 Hello Vivek and Andrew,
 
 We just realized that Hatayama's mmap patches went into v3.11-rc1. This 
 currently
 breaks s390 kdump because of the following two issues:
 
 1) The copy_oldmem_page() is now used for copying to vmalloc memory
 2) The mmap() implementation is not compatible with the current
s390 crashkernel swap:
See: http://marc.info/?l=kexecm=136940802511603w=2
 
 The kdump: Introduce ELF header in new memory feature patch series will
 fix both issues for s390.
 
 There is the one small open discussion left:
 
 http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html
 
 But once we have finished that, would it be possible to get the
 patches in 3.11?

How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
mmap() on s390 forcing makedumpfile to fall back on read() interface. That
way there will be no regression and mmap() related speedup will show up
in next release on s390.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread HATAYAMA Daisuke

(2013/07/15 23:20), Vivek Goyal wrote:

On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:

[..]

How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 return VM_FAULT_SIGBUS;
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.


I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.

If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of
memory which it is supposed to access.

Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?



On the current design, there's no page faults on memory mapped by 
remap_pfn_range().
They map a whole range in the current design. If there are page faults, page 
table of the process
is broken in their some page entries. This indicates the process's bahaviour is 
affected by
some software/hardware bugs. In theory, process could result in arbitrary 
behaviour. We cannot
detect the reason and recover the original sane state. The only thing we can do 
is to kill
the process and drop the possibility of the process to affect other system 
components and of
system to result in worse situation.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-15 Thread HATAYAMA Daisuke

(2013/07/15 18:21), Martin Schwidefsky wrote:

On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


(2013/07/10 20:00), Michael Holzheu wrote:

On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

[snip]


(2013/07/10 17:42), Michael Holzheu wrote:

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.


Currently it is 32 MiB and normally it is read only once.



Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:



What you think the sanity check is unnecessary is perfectly wrong. You design 
page faults
always happens on HSA region. If page fault happens on the other parts, i.e. 
some point
of mmap()ed region, it means somehow page table on the address has not been 
created. This
is bug, possibly caused by mmap() itself, page table creation, other components 
in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
of bug occurs
now. There's no guarantee that program runs safely, of course for page cache 
creation, too.
We cannot and must expect such buggy process to behave in invalid states just 
as our design.
It results in undefined behaviour. The only thing we can do is to kill the 
buggy process
as soon as possible.


I don't quite get this point, please bear with me. If you compare the situation 
before and
after the introduction of the fault handler the possible error scenarios are 
not almost
identical:
1) If an access is made outside of the mapped memory region the first level 
fault handler
(do_exception for s390, __do_page_fault for x86) won't find a vma and force 
a SIGSEGV
right away, independent of the existance of a hole and the vmcore fault 
handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends 
on how the
entries are corrupted. If the outcome is a valid pte an incorrect memory 
area will be
accessed, the same with or without the vmcore fault handler. If the 
corrupted pte is
an invalid pte it can come out as swap pte, file pte, or as empty pte. The 
behaviour
does not change for swap and file ptes, you will get funny results in both 
cases.
For empty ptes the new behaviour will call the vmcore fault handler for the 
address
in question. If the read() function can satisfy the request we will get a 
page cache
copy of the missing page, if the read function can not satisfy the request 
it returns
an error which is translated to a SIGBUS.
This new behaviour is IMHO better than the old one, it successfully 
recovers from at
least one type of corruption. For x86 that would be the case if the page 
table is
overwritten with zeroes, for s390 a specific bit pattern in the pte is 
required.


As you already noticed here, this senario works well only if there's page table 
corruption
only. Process could be corrupted more.


3) In the case of a programming error in regard to remap_pfn_range the new 
behaviour will
provide page cache copies and the dump will technically be correct. The 
performance
might suffer a little bit as the CPU will have to create the page cache 
copies but
compared to the I/O that is involved with writing a dump this is 
negligible, no?

It seems to me that the warning message you want to see in the fault handler 
would be
a debugging aid for the developer to see if the mmap() and the 
remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be 
appropriate, no?



warning message is meaningless, which doesn't stop process. I no longer 
consider debugging.
It's valid to kill the process as soon as possible. If there's page fault to 
the address we
don't intend, it's buggy, and so we don't guess how the buggy process behave. 
Please consider
the cases that system goes into more catastrophic situation after we leave the 
process running to
get warning messages for debugging purposes.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-12 Thread HATAYAMA Daisuke

(2013/07/10 20:00), Michael Holzheu wrote:

On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke  wrote:

[snip]


(2013/07/10 17:42), Michael Holzheu wrote:

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.


Currently it is 32 MiB and normally it is read only once.



Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:



What you think the sanity check is unnecessary is perfectly wrong. You design 
page faults
always happens on HSA region. If page fault happens on the other parts, i.e. 
some point
of mmap()ed region, it means somehow page table on the address has not been 
created. This
is bug, possibly caused by mmap() itself, page table creation, other components 
in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
of bug occurs
now. There's no guarantee that program runs safely, of course for page cache 
creation, too.
We cannot and must expect such buggy process to behave in invalid states just 
as our design.
It results in undefined behaviour. The only thing we can do is to kill the 
buggy process
as soon as possible.


With the current patch series this check is only relevant for s390 :-)




So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
  char *buf;
  int rc;

#ifndef CONFIG_S390
  WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
  page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.



Could you explain the plan in more detail? Or I cannot review correctly since I 
don't
know whether there's really usecase of this generic fault handler for other
architectures.
This is the issue for architectures other than s390, not mine; now we
don't need it at all.


I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.

But FYI:

The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.

Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.

One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.

The storage key patch series is currently not final because it depends
on the zfcpdump patch series.



Yes, on-demand memory allocation by fault handler is definitely suitable
for note sections throughout all architectures. But need of sanity check
is unchanged here. The issue is orthogonal. It's still needed just as
explained above.

BTW, sanity check for note sections is easy.

if (!(elfnote->p_offset <= offset ||
  offset < elfnote->p_offset + elfnote->p_filesz))
return VM_FAULT_SIGBUS;

I assume elfnote has a pointer to a unique PT_NOTE program header table
in the ELF header buffer. The code would be the same on every
architecture.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-12 Thread HATAYAMA Daisuke

(2013/07/10 23:33), Vivek Goyal wrote:

On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:

[..]

If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.

Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


Hi Hatayama,

I think michael's proposal of just putting in WARN_ONCE for non s390 arch
sounds reasonable. It is simple and meets your need of being able to
detect that non s390 arch don't make use of mmap() path yet. Introducing
in_valid_fault_range() kind of sounds overkill to me for this issue.

Thanks
Vivek



How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;

#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-12 Thread HATAYAMA Daisuke

(2013/07/10 23:33), Vivek Goyal wrote:

On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:

[..]

If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.

Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


Hi Hatayama,

I think michael's proposal of just putting in WARN_ONCE for non s390 arch
sounds reasonable. It is simple and meets your need of being able to
detect that non s390 arch don't make use of mmap() path yet. Introducing
in_valid_fault_range() kind of sounds overkill to me for this issue.

Thanks
Vivek



How about

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;

#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);

Considering again, I don't think WARN_ONCE() is good now. The fact that fault 
occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The 
process
should be killed as soon as possible. If user still wants to get crash dump, he 
should
try again in another process.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-12 Thread HATAYAMA Daisuke

(2013/07/10 20:00), Michael Holzheu wrote:

On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

[snip]


(2013/07/10 17:42), Michael Holzheu wrote:

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.


Currently it is 32 MiB and normally it is read only once.



Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:



What you think the sanity check is unnecessary is perfectly wrong. You design 
page faults
always happens on HSA region. If page fault happens on the other parts, i.e. 
some point
of mmap()ed region, it means somehow page table on the address has not been 
created. This
is bug, possibly caused by mmap() itself, page table creation, other components 
in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind 
of bug occurs
now. There's no guarantee that program runs safely, of course for page cache 
creation, too.
We cannot and must expect such buggy process to behave in invalid states just 
as our design.
It results in undefined behaviour. The only thing we can do is to kill the 
buggy process
as soon as possible.


With the current patch series this check is only relevant for s390 :-)




So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
  char *buf;
  int rc;

#ifndef CONFIG_S390
  WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault());
#endif
  page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.



Could you explain the plan in more detail? Or I cannot review correctly since I 
don't
know whether there's really usecase of this generic fault handler for other
architectures.
This is the issue for architectures other than s390, not mine; now we
don't need it at all.


I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.

But FYI:

The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.

Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.

One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.

The storage key patch series is currently not final because it depends
on the zfcpdump patch series.



Yes, on-demand memory allocation by fault handler is definitely suitable
for note sections throughout all architectures. But need of sanity check
is unchanged here. The issue is orthogonal. It's still needed just as
explained above.

BTW, sanity check for note sections is easy.

if (!(elfnote-p_offset = offset ||
  offset  elfnote-p_offset + elfnote-p_filesz))
return VM_FAULT_SIGBUS;

I assume elfnote has a pointer to a unique PT_NOTE program header table
in the ELF header buffer. The code would be the same on every
architecture.

--
Thanks.
HATAYAMA, Daisuke

--
To 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Vivek Goyal
On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:

[..]
> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
> number
> of the elements, you can still calculate the range of offsets in /proc/vmcore
> corresponding to HSA during /proc/vmcore initialization.
> 
> Also, could you tell me how often and how much the HSA region is during crash 
> dumping?
> I guess the read to HSA is done mainly during early part of crash dumping 
> process only.
> According to the code, it appears at most 64MiB only. Then, I feel 
> performance is not
> a big issue.
> 
> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
> calls. I don't
> think it too much overhead...

Hi Hatayama,

I think michael's proposal of just putting in WARN_ONCE for non s390 arch
sounds reasonable. It is simple and meets your need of being able to 
detect that non s390 arch don't make use of mmap() path yet. Introducing
in_valid_fault_range() kind of sounds overkill to me for this issue.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Michael Holzheu
On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke  wrote:

[snip]

> (2013/07/10 17:42), Michael Holzheu wrote:
> > My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has 
> > the same
> > effect as your suggestion for all architectures besides of s390. And for 
> > s390 we
> > take the risk that a programming error would result in poor /proc/vmcore
> > performance.
> >
> 
> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
> number
> of the elements, you can still calculate the range of offsets in /proc/vmcore
> corresponding to HSA during /proc/vmcore initialization.
> 
> Also, could you tell me how often and how much the HSA region is during crash 
> dumping?
> I guess the read to HSA is done mainly during early part of crash dumping 
> process only.
> According to the code, it appears at most 64MiB only. Then, I feel 
> performance is not
> a big issue.

Currently it is 32 MiB and normally it is read only once.

> 
> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
> calls. I don't
> think it too much overhead...

I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:

With the current patch series this check is only relevant for s390 :-)

> 
> > So, at least for this patch series I would implement the fault handler as 
> > follows:
> >
> > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
> > *vmf)
> > {
> > ...
> >  char *buf;
> >  int rc;
> >
> > #ifndef CONFIG_S390
> >  WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
> > #endif
> >  page = find_or_create_page(mapping, index, GFP_KERNEL);
> >
> > At this point I have to tell you that we plan another vmcore patch series 
> > where
> > the fault handler might be called also for other architectures. But I think 
> > we
> > should *then* discuss your issue again.
> >
> 
> Could you explain the plan in more detail? Or I cannot review correctly since 
> I don't
> know whether there's really usecase of this generic fault handler for other
> architectures.
> This is the issue for architectures other than s390, not mine; now we
> don't need it at all.

I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.

But FYI:

The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.

Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.

One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.

The storage key patch series is currently not final because it depends
on the zfcpdump patch series. 

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread HATAYAMA Daisuke

(2013/07/10 17:42), Michael Holzheu wrote:

Hello Hatayama,

On Tue, 09 Jul 2013 14:49:48 +0900
HATAYAMA Daisuke  wrote:


(2013/07/08 23:28), Vivek Goyal wrote:

On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke  wrote:


[snip]


I personally perfer not to special case it for s390 only and let the
handler be generic.

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.



I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.

Interface is like this?

[generic]

bool __weak in_valid_fault_range(pgoff_t pgoff)
{
  return false;
}

[s390]

bool in_valid_fault_range(pgoff_t pgoff)
{
  loff_t offset = pgoff << PAGE_CACHE_SHIFT;
  u64 paddr = vmcore_offset_to_paddr(offset);

  return paddr < ZFCPDUMP_HSA_SIZE;
}

assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
address corresponding to given offset of vmcore. I guess this could return error
value if there's no entry corresponding to given offset in vmcore_list.


I think this is too much code (and overhead) just for checking the correctness 
the
kdump mmap implementation.

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.

Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.



Could you explain the plan in more detail? Or I cannot review correctly since I 
don't
know whether there's really usecase of this generic fault handler for other
architectures. This is the issue for architectures other than s390, not mine; 
now we
don't need it at all.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Michael Holzheu
Hello Hatayama,

On Tue, 09 Jul 2013 14:49:48 +0900
HATAYAMA Daisuke  wrote:

> (2013/07/08 23:28), Vivek Goyal wrote:
> > On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
> >> On Mon, 08 Jul 2013 14:32:09 +0900
> >> HATAYAMA Daisuke  wrote:

[snip]

> > I personally perfer not to special case it for s390 only and let the
> > handler be generic.
> >
> > If there is a bug in remap_old_pfn_range(), only side affect is that
> > we will fault in the page when it is accessed and that will be slow. BUG()
> > sounds excessive. At max it could be WARN_ONCE().
> >
> > In regular cases for x86, this path should not even hit. So special casing
> > it to detect issues with remap_old_pfn_range() does not sound very good
> > to me. I would rather leave it as it is and if there are bugs and mmap()
> > slows down, then somebody needs to debug it.
> >
> 
> I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.
> 
> Interface is like this?
> 
> [generic]
> 
> bool __weak in_valid_fault_range(pgoff_t pgoff)
> {
>  return false;
> }
> 
> [s390]
> 
> bool in_valid_fault_range(pgoff_t pgoff)
> {
>  loff_t offset = pgoff << PAGE_CACHE_SHIFT;
>  u64 paddr = vmcore_offset_to_paddr(offset);
> 
>  return paddr < ZFCPDUMP_HSA_SIZE;
> }
> 
> assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns 
> physical
> address corresponding to given offset of vmcore. I guess this could return 
> error
> value if there's no entry corresponding to given offset in vmcore_list.

I think this is too much code (and overhead) just for checking the correctness 
the
kdump mmap implementation.

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.

So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;

#ifndef CONFIG_S390
WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.

So in order to make progress with this patch series (which is also needed to
make your mmap patches work for s390) I would suggest to use the following 
patch:
---
 fs/proc/vmcore.c   |   90 +
 include/linux/crash_dump.h |3 +
 2 files changed, 85 insertions(+), 8 deletions(-)

--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Michael Holzheu
Hello Hatayama,

On Tue, 09 Jul 2013 14:49:48 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

 (2013/07/08 23:28), Vivek Goyal wrote:
  On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
  On Mon, 08 Jul 2013 14:32:09 +0900
  HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

[snip]

  I personally perfer not to special case it for s390 only and let the
  handler be generic.
 
  If there is a bug in remap_old_pfn_range(), only side affect is that
  we will fault in the page when it is accessed and that will be slow. BUG()
  sounds excessive. At max it could be WARN_ONCE().
 
  In regular cases for x86, this path should not even hit. So special casing
  it to detect issues with remap_old_pfn_range() does not sound very good
  to me. I would rather leave it as it is and if there are bugs and mmap()
  slows down, then somebody needs to debug it.
 
 
 I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.
 
 Interface is like this?
 
 [generic]
 
 bool __weak in_valid_fault_range(pgoff_t pgoff)
 {
  return false;
 }
 
 [s390]
 
 bool in_valid_fault_range(pgoff_t pgoff)
 {
  loff_t offset = pgoff  PAGE_CACHE_SHIFT;
  u64 paddr = vmcore_offset_to_paddr(offset);
 
  return paddr  ZFCPDUMP_HSA_SIZE;
 }
 
 assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns 
 physical
 address corresponding to given offset of vmcore. I guess this could return 
 error
 value if there's no entry corresponding to given offset in vmcore_list.

I think this is too much code (and overhead) just for checking the correctness 
the
kdump mmap implementation.

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.

So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;

#ifndef CONFIG_S390
WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault());
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.

So in order to make progress with this patch series (which is also needed to
make your mmap patches work for s390) I would suggest to use the following 
patch:
---
 fs/proc/vmcore.c   |   90 +
 include/linux/crash_dump.h |3 +
 2 files changed, 85 insertions(+), 8 deletions(-)

--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include linux/crash_dump.h
 #include linux/list.h
 #include linux/vmalloc.h
+#include linux/pagemap.h
 #include asm/uaccess.h
 #include asm/io.h
 #include internal.h
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *
/* Read ELF core header */
if (*fpos  elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread HATAYAMA Daisuke

(2013/07/10 17:42), Michael Holzheu wrote:

Hello Hatayama,

On Tue, 09 Jul 2013 14:49:48 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


(2013/07/08 23:28), Vivek Goyal wrote:

On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


[snip]


I personally perfer not to special case it for s390 only and let the
handler be generic.

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.



I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.

Interface is like this?

[generic]

bool __weak in_valid_fault_range(pgoff_t pgoff)
{
  return false;
}

[s390]

bool in_valid_fault_range(pgoff_t pgoff)
{
  loff_t offset = pgoff  PAGE_CACHE_SHIFT;
  u64 paddr = vmcore_offset_to_paddr(offset);

  return paddr  ZFCPDUMP_HSA_SIZE;
}

assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
address corresponding to given offset of vmcore. I guess this could return error
value if there's no entry corresponding to given offset in vmcore_list.


I think this is too much code (and overhead) just for checking the correctness 
the
kdump mmap implementation.

My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the 
same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.



If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash 
dumping?
I guess the read to HSA is done mainly during early part of crash dumping 
process only.
According to the code, it appears at most 64MiB only. Then, I feel performance 
is not
a big issue.

Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. 
I don't
think it too much overhead...


So, at least for this patch series I would implement the fault handler as 
follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
 char *buf;
 int rc;

#ifndef CONFIG_S390
 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault());
#endif
 page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.



Could you explain the plan in more detail? Or I cannot review correctly since I 
don't
know whether there's really usecase of this generic fault handler for other
architectures. This is the issue for architectures other than s390, not mine; 
now we
don't need it at all.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Michael Holzheu
On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

[snip]

 (2013/07/10 17:42), Michael Holzheu wrote:
  My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has 
  the same
  effect as your suggestion for all architectures besides of s390. And for 
  s390 we
  take the risk that a programming error would result in poor /proc/vmcore
  performance.
 
 
 If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
 number
 of the elements, you can still calculate the range of offsets in /proc/vmcore
 corresponding to HSA during /proc/vmcore initialization.
 
 Also, could you tell me how often and how much the HSA region is during crash 
 dumping?
 I guess the read to HSA is done mainly during early part of crash dumping 
 process only.
 According to the code, it appears at most 64MiB only. Then, I feel 
 performance is not
 a big issue.

Currently it is 32 MiB and normally it is read only once.

 
 Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
 calls. I don't
 think it too much overhead...

I was more concerned about in_valid_fault_range(). But I was most concerned the 
additional
interface that introduces more complexity to the code. And that just to 
implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:

With the current patch series this check is only relevant for s390 :-)

 
  So, at least for this patch series I would implement the fault handler as 
  follows:
 
  static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
  *vmf)
  {
  ...
   char *buf;
   int rc;
 
  #ifndef CONFIG_S390
   WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault());
  #endif
   page = find_or_create_page(mapping, index, GFP_KERNEL);
 
  At this point I have to tell you that we plan another vmcore patch series 
  where
  the fault handler might be called also for other architectures. But I think 
  we
  should *then* discuss your issue again.
 
 
 Could you explain the plan in more detail? Or I cannot review correctly since 
 I don't
 know whether there's really usecase of this generic fault handler for other
 architectures.
 This is the issue for architectures other than s390, not mine; now we
 don't need it at all.

I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.

But FYI:

The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.

Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.

One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.

The storage key patch series is currently not final because it depends
on the zfcpdump patch series. 

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-10 Thread Vivek Goyal
On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:

[..]
 If you want to avoid looking up vmcore_list that takes linear time w.r.t. the 
 number
 of the elements, you can still calculate the range of offsets in /proc/vmcore
 corresponding to HSA during /proc/vmcore initialization.
 
 Also, could you tell me how often and how much the HSA region is during crash 
 dumping?
 I guess the read to HSA is done mainly during early part of crash dumping 
 process only.
 According to the code, it appears at most 64MiB only. Then, I feel 
 performance is not
 a big issue.
 
 Also, cost of WARN_ONCE() is one memory access only in the 2nd and later 
 calls. I don't
 think it too much overhead...

Hi Hatayama,

I think michael's proposal of just putting in WARN_ONCE for non s390 arch
sounds reasonable. It is simple and meets your need of being able to 
detect that non s390 arch don't make use of mmap() path yet. Introducing
in_valid_fault_range() kind of sounds overkill to me for this issue.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread HATAYAMA Daisuke

(2013/07/08 23:28), Vivek Goyal wrote:

On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke  wrote:


(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.


You can test the code on other architectures if you do not map anything in 
advance.
For example you could just "return 0" in remap_oldmem_pfn_range():

/*
  * Architectures may override this function to map oldmem
  */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
   unsigned long from, unsigned long pfn,
   unsigned long size, pgprot_t prot)
{
 return 0;
}

In that case for all pages the new mechanism would be used.



I'm also concerned about the fault handler covers a full range of vmcore, which
could hide some kind of mmap() bug that results in page fault.

So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
being.


I personally do not like that, but if Vivek and you prefer this, of course we
can do that.

What about something like:

#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif


I personally perfer not to special case it for s390 only and let the
handler be generic.

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.



I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.

Interface is like this?

[generic]

bool __weak in_valid_fault_range(pgoff_t pgoff)
{
return false;
}

[s390]

bool in_valid_fault_range(pgoff_t pgoff)
{
loff_t offset = pgoff << PAGE_CACHE_SHIFT;
u64 paddr = vmcore_offset_to_paddr(offset);

return paddr < ZFCPDUMP_HSA_SIZE;
}

assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
address corresponding to given offset of vmcore. I guess this could return error
value if there's no entry corresponding to given offset in vmcore_list.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread HATAYAMA Daisuke

(2013/07/08 18:28), Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke  wrote:


(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.


You can test the code on other architectures if you do not map anything in 
advance.
For example you could just "return 0" in remap_oldmem_pfn_range():

/*
  * Architectures may override this function to map oldmem
  */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
   unsigned long from, unsigned long pfn,
   unsigned long size, pgprot_t prot)
{
 return 0;
}

In that case for all pages the new mechanism would be used.



I meant without modifying source code at all. You say I need to define some 
function.





+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma->vm_file->f_mapping;
+   pgoff_t index = vmf->pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+


You should check where faulting address points to valid range.
If the fault happens on invalid range, return VM_FAULT_SIGBUS.

On s390 case, I think the range except for HSA should be thought of as invalid.



Hmmm, this would require another architecture backend interface. If we get a 
fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.



I think you need to introduce the backend interface since it's bug if it 
happens.
The current implementation hides such erroneous path due to generic 
implementation.

I also don't think it's big change from this version since you have already been
about to introduce several backend interfaces.


+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src = index << PAGE_CACHE_SHIFT;


src = (loff_t)index << PAGE_CACHE_SHIFT;

loff_t has long long while index has unsigned long.


Ok.


On s390 both might have the same byte length.


On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.


Also I prefer offset to src, but this is minor suggestion.


Yes, I agree.




+   buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);


I found page_to_virt() macro.


Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.

But when I discussed your comment with Martin, we found out that the current

buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);

is not correct on x86. It should be:

buf = __va((page_to_pfn(page) << PAGE_SHIFT));



It seems OK for this.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread Vivek Goyal
On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
> On Mon, 08 Jul 2013 14:32:09 +0900
> HATAYAMA Daisuke  wrote:
> 
> > (2013/07/02 4:32), Michael Holzheu wrote:
> > > For zfcpdump we can't map the HSA storage because it is only available
> > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > introduce a new mechanism to create mappings on demand.
> > >
> > > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > areas that can be directly mapped. For zfcpdump this is everything besides
> > > of the HSA memory. For the areas that are not mapped by 
> > > remap_oldmem_pfn_range()
> > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > is called.
> > >
> > 
> > This fault handler is only for s390 specific issue. Other architectures 
> > don't need
> > this for the time being.
> > 
> > Also, from the same reason, I'm doing this review based on source code only.
> > I cannot run the fault handler on meaningful system, which is currently 
> > s390 only.
> 
> You can test the code on other architectures if you do not map anything in 
> advance.
> For example you could just "return 0" in remap_oldmem_pfn_range():
> 
> /*
>  * Architectures may override this function to map oldmem
>  */
> int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>   unsigned long from, unsigned long pfn,
>   unsigned long size, pgprot_t prot)
> {
> return 0;
> }
> 
> In that case for all pages the new mechanism would be used.
> 
> > 
> > I'm also concerned about the fault handler covers a full range of vmcore, 
> > which
> > could hide some kind of mmap() bug that results in page fault.
> > 
> > So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
> > being.
> 
> I personally do not like that, but if Vivek and you prefer this, of course we
> can do that.
> 
> What about something like:
> 
> #ifdef CONFIG_S390
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> }
> #else
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
>   BUG();
> }
> #endif

I personally perfer not to special case it for s390 only and let the
handler be generic. 

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread Michael Holzheu
On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke  wrote:

> (2013/07/02 4:32), Michael Holzheu wrote:
> > For zfcpdump we can't map the HSA storage because it is only available
> > via a read interface. Therefore, for the new vmcore mmap feature we have
> > introduce a new mechanism to create mappings on demand.
> >
> > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > that should be used to create mappings with remap_pfn_range() for oldmem
> > areas that can be directly mapped. For zfcpdump this is everything besides
> > of the HSA memory. For the areas that are not mapped by 
> > remap_oldmem_pfn_range()
> > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > is called.
> >
> 
> This fault handler is only for s390 specific issue. Other architectures don't 
> need
> this for the time being.
> 
> Also, from the same reason, I'm doing this review based on source code only.
> I cannot run the fault handler on meaningful system, which is currently s390 
> only.

You can test the code on other architectures if you do not map anything in 
advance.
For example you could just "return 0" in remap_oldmem_pfn_range():

/*
 * Architectures may override this function to map oldmem
 */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
  unsigned long from, unsigned long pfn,
  unsigned long size, pgprot_t prot)
{
return 0;
}

In that case for all pages the new mechanism would be used.

> 
> I'm also concerned about the fault handler covers a full range of vmcore, 
> which
> could hide some kind of mmap() bug that results in page fault.
> 
> So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
> being.

I personally do not like that, but if Vivek and you prefer this, of course we
can do that.

What about something like:

#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif

> 
> > This handler works as follows:
> >
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> >Return that page
> > * If no:
> >Fill page using __vmcore_read(), set PageUptodate, and return page
> >
> 
> It seems good to me on this page-in logic.
> 
> 
> > @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char 
> > __user *buffer,
> > return acc;
> >   }
> >
> > +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> > +  size_t buflen, loff_t *fpos)
> > +{
> > +   return __read_vmcore(buffer, buflen, fpos, 1);
> > +}
> > +
> > +/*
> > + * The vmcore fault handler uses the page cache and fills data using the
> > + * standard __vmcore_read() function.
> > + */
> 
> Could you describe usecase of this fault handler on s390?

ok.

> 
> > +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
> > *vmf)
> > +{
> > +   struct address_space *mapping = vma->vm_file->f_mapping;
> > +   pgoff_t index = vmf->pgoff;
> > +   struct page *page;
> > +   loff_t src;
> > +   char *buf;
> > +   int rc;
> > +
> 
> You should check where faulting address points to valid range.
> If the fault happens on invalid range, return VM_FAULT_SIGBUS.
> 
> On s390 case, I think the range except for HSA should be thought of as 
> invalid.
> 

Hmmm, this would require another architecture backend interface. If we get a 
fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.

> > +   page = find_or_create_page(mapping, index, GFP_KERNEL);
> > +   if (!page)
> > +   return VM_FAULT_OOM;
> > +   if (!PageUptodate(page)) {
> > +   src = index << PAGE_CACHE_SHIFT;
> 
> src = (loff_t)index << PAGE_CACHE_SHIFT;
> 
> loff_t has long long while index has unsigned long.

Ok.

> On s390 both might have the same byte length.

On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.

> Also I prefer offset to src, but this is minor suggestion.

Yes, I agree.

> 
> > +   buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
> 
> I found page_to_virt() macro.

Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.

But when I discussed your comment with Martin, we found out that the current

buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);

is not correct on x86. It should be:

buf = __va((page_to_pfn(page) << PAGE_SHIFT));

So would be the following patch ok for you?
---
 fs/proc/vmcore.c   |   94 +
 include/linux/crash_dump.h |3 +
 2 files 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread Michael Holzheu
On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

 (2013/07/02 4:32), Michael Holzheu wrote:
  For zfcpdump we can't map the HSA storage because it is only available
  via a read interface. Therefore, for the new vmcore mmap feature we have
  introduce a new mechanism to create mappings on demand.
 
  This patch introduces a new architecture function remap_oldmem_pfn_range()
  that should be used to create mappings with remap_pfn_range() for oldmem
  areas that can be directly mapped. For zfcpdump this is everything besides
  of the HSA memory. For the areas that are not mapped by 
  remap_oldmem_pfn_range()
  a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
  is called.
 
 
 This fault handler is only for s390 specific issue. Other architectures don't 
 need
 this for the time being.
 
 Also, from the same reason, I'm doing this review based on source code only.
 I cannot run the fault handler on meaningful system, which is currently s390 
 only.

You can test the code on other architectures if you do not map anything in 
advance.
For example you could just return 0 in remap_oldmem_pfn_range():

/*
 * Architectures may override this function to map oldmem
 */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
  unsigned long from, unsigned long pfn,
  unsigned long size, pgprot_t prot)
{
return 0;
}

In that case for all pages the new mechanism would be used.

 
 I'm also concerned about the fault handler covers a full range of vmcore, 
 which
 could hide some kind of mmap() bug that results in page fault.
 
 So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
 being.

I personally do not like that, but if Vivek and you prefer this, of course we
can do that.

What about something like:

#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif

 
  This handler works as follows:
 
  * Get already available or new page from page cache (find_or_create_page)
  * Check if /proc/vmcore page is filled with data (PageUptodate)
  * If yes:
 Return that page
  * If no:
 Fill page using __vmcore_read(), set PageUptodate, and return page
 
 
 It seems good to me on this page-in logic.
 
 cut
  @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char 
  __user *buffer,
  return acc;
}
 
  +static ssize_t read_vmcore(struct file *file, char __user *buffer,
  +  size_t buflen, loff_t *fpos)
  +{
  +   return __read_vmcore(buffer, buflen, fpos, 1);
  +}
  +
  +/*
  + * The vmcore fault handler uses the page cache and fills data using the
  + * standard __vmcore_read() function.
  + */
 
 Could you describe usecase of this fault handler on s390?

ok.

 
  +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
  *vmf)
  +{
  +   struct address_space *mapping = vma-vm_file-f_mapping;
  +   pgoff_t index = vmf-pgoff;
  +   struct page *page;
  +   loff_t src;
  +   char *buf;
  +   int rc;
  +
 
 You should check where faulting address points to valid range.
 If the fault happens on invalid range, return VM_FAULT_SIGBUS.
 
 On s390 case, I think the range except for HSA should be thought of as 
 invalid.
 

Hmmm, this would require another architecture backend interface. If we get a 
fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.

  +   page = find_or_create_page(mapping, index, GFP_KERNEL);
  +   if (!page)
  +   return VM_FAULT_OOM;
  +   if (!PageUptodate(page)) {
  +   src = index  PAGE_CACHE_SHIFT;
 
 src = (loff_t)index  PAGE_CACHE_SHIFT;
 
 loff_t has long long while index has unsigned long.

Ok.

 On s390 both might have the same byte length.

On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.

 Also I prefer offset to src, but this is minor suggestion.

Yes, I agree.

 
  +   buf = (void *) (page_to_pfn(page)  PAGE_SHIFT);
 
 I found page_to_virt() macro.

Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.

But when I discussed your comment with Martin, we found out that the current

buf = (void *) (page_to_pfn(page)  PAGE_SHIFT);

is not correct on x86. It should be:

buf = __va((page_to_pfn(page)  PAGE_SHIFT));

So would be the following patch ok for you?
---
 fs/proc/vmcore.c   |   94 +
 include/linux/crash_dump.h |3 +
 2 files changed, 89 insertions(+), 8 deletions(-)

--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include 

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread Vivek Goyal
On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
 On Mon, 08 Jul 2013 14:32:09 +0900
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:
 
  (2013/07/02 4:32), Michael Holzheu wrote:
   For zfcpdump we can't map the HSA storage because it is only available
   via a read interface. Therefore, for the new vmcore mmap feature we have
   introduce a new mechanism to create mappings on demand.
  
   This patch introduces a new architecture function remap_oldmem_pfn_range()
   that should be used to create mappings with remap_pfn_range() for oldmem
   areas that can be directly mapped. For zfcpdump this is everything besides
   of the HSA memory. For the areas that are not mapped by 
   remap_oldmem_pfn_range()
   a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
   is called.
  
  
  This fault handler is only for s390 specific issue. Other architectures 
  don't need
  this for the time being.
  
  Also, from the same reason, I'm doing this review based on source code only.
  I cannot run the fault handler on meaningful system, which is currently 
  s390 only.
 
 You can test the code on other architectures if you do not map anything in 
 advance.
 For example you could just return 0 in remap_oldmem_pfn_range():
 
 /*
  * Architectures may override this function to map oldmem
  */
 int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
   unsigned long from, unsigned long pfn,
   unsigned long size, pgprot_t prot)
 {
 return 0;
 }
 
 In that case for all pages the new mechanism would be used.
 
  
  I'm also concerned about the fault handler covers a full range of vmcore, 
  which
  could hide some kind of mmap() bug that results in page fault.
  
  So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
  being.
 
 I personally do not like that, but if Vivek and you prefer this, of course we
 can do that.
 
 What about something like:
 
 #ifdef CONFIG_S390
 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 ...
 }
 #else
 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
   BUG();
 }
 #endif

I personally perfer not to special case it for s390 only and let the
handler be generic. 

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread HATAYAMA Daisuke

(2013/07/08 18:28), Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.


You can test the code on other architectures if you do not map anything in 
advance.
For example you could just return 0 in remap_oldmem_pfn_range():

/*
  * Architectures may override this function to map oldmem
  */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
   unsigned long from, unsigned long pfn,
   unsigned long size, pgprot_t prot)
{
 return 0;
}

In that case for all pages the new mechanism would be used.



I meant without modifying source code at all. You say I need to define some 
function.

cut



+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma-vm_file-f_mapping;
+   pgoff_t index = vmf-pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+


You should check where faulting address points to valid range.
If the fault happens on invalid range, return VM_FAULT_SIGBUS.

On s390 case, I think the range except for HSA should be thought of as invalid.



Hmmm, this would require another architecture backend interface. If we get a 
fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.



I think you need to introduce the backend interface since it's bug if it 
happens.
The current implementation hides such erroneous path due to generic 
implementation.

I also don't think it's big change from this version since you have already been
about to introduce several backend interfaces.


+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src = index  PAGE_CACHE_SHIFT;


src = (loff_t)index  PAGE_CACHE_SHIFT;

loff_t has long long while index has unsigned long.


Ok.


On s390 both might have the same byte length.


On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.


Also I prefer offset to src, but this is minor suggestion.


Yes, I agree.




+   buf = (void *) (page_to_pfn(page)  PAGE_SHIFT);


I found page_to_virt() macro.


Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.

But when I discussed your comment with Martin, we found out that the current

buf = (void *) (page_to_pfn(page)  PAGE_SHIFT);

is not correct on x86. It should be:

buf = __va((page_to_pfn(page)  PAGE_SHIFT));



It seems OK for this.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-08 Thread HATAYAMA Daisuke

(2013/07/08 23:28), Vivek Goyal wrote:

On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:

On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:


(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.


You can test the code on other architectures if you do not map anything in 
advance.
For example you could just return 0 in remap_oldmem_pfn_range():

/*
  * Architectures may override this function to map oldmem
  */
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
   unsigned long from, unsigned long pfn,
   unsigned long size, pgprot_t prot)
{
 return 0;
}

In that case for all pages the new mechanism would be used.



I'm also concerned about the fault handler covers a full range of vmcore, which
could hide some kind of mmap() bug that results in page fault.

So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
being.


I personally do not like that, but if Vivek and you prefer this, of course we
can do that.

What about something like:

#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif


I personally perfer not to special case it for s390 only and let the
handler be generic.

If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().

In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.



I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.

Interface is like this?

[generic]

bool __weak in_valid_fault_range(pgoff_t pgoff)
{
return false;
}

[s390]

bool in_valid_fault_range(pgoff_t pgoff)
{
loff_t offset = pgoff  PAGE_CACHE_SHIFT;
u64 paddr = vmcore_offset_to_paddr(offset);

return paddr  ZFCPDUMP_HSA_SIZE;
}

assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
address corresponding to given offset of vmcore. I guess this could return error
value if there's no entry corresponding to given offset in vmcore_list.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-07 Thread HATAYAMA Daisuke

(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.

I'm also concerned about the fault handler covers a full range of vmcore, which
could hide some kind of mmap() bug that results in page fault.

So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
being.


This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
   Return that page
* If no:
   Fill page using __vmcore_read(), set PageUptodate, and return page



It seems good to me on this page-in logic.



@@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
  }

+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ */


Could you describe usecase of this fault handler on s390?


+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma->vm_file->f_mapping;
+   pgoff_t index = vmf->pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+


You should check where faulting address points to valid range.
If the fault happens on invalid range, return VM_FAULT_SIGBUS.

On s390 case, I think the range except for HSA should be thought of as invalid.


+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src = index << PAGE_CACHE_SHIFT;


src = (loff_t)index << PAGE_CACHE_SHIFT;

loff_t has long long while index has unsigned long.
On s390 both might have the same byte length.

Also I prefer offset to src, but this is minor suggestion.


+   buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);


I found page_to_virt() macro.


+   rc = __read_vmcore(buf, PAGE_SIZE, , 0);
+   if (rc < 0) {
+   unlock_page(page);
+   page_cache_release(page);
+   return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+   }
+   SetPageUptodate(page);
+   }
+   unlock_page(page);
+   vmf->page = page;
+   return 0;
+}
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+   .fault = mmap_vmcore_fault,
+};
+
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
size_t size = vma->vm_end - vma->vm_start;
@@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct 
vm_area_struct *vma)

vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
vma->vm_flags |= VM_MIXEDMAP;
+   vma->vm_ops = _mmap_ops;

len = 0;



--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-07 Thread HATAYAMA Daisuke

(2013/07/02 4:32), Michael Holzheu wrote:

For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.



This fault handler is only for s390 specific issue. Other architectures don't 
need
this for the time being.

Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 
only.

I'm also concerned about the fault handler covers a full range of vmcore, which
could hide some kind of mmap() bug that results in page fault.

So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time 
being.


This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
   Return that page
* If no:
   Fill page using __vmcore_read(), set PageUptodate, and return page



It seems good to me on this page-in logic.

cut

@@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
  }

+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ */


Could you describe usecase of this fault handler on s390?


+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma-vm_file-f_mapping;
+   pgoff_t index = vmf-pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+


You should check where faulting address points to valid range.
If the fault happens on invalid range, return VM_FAULT_SIGBUS.

On s390 case, I think the range except for HSA should be thought of as invalid.


+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src = index  PAGE_CACHE_SHIFT;


src = (loff_t)index  PAGE_CACHE_SHIFT;

loff_t has long long while index has unsigned long.
On s390 both might have the same byte length.

Also I prefer offset to src, but this is minor suggestion.


+   buf = (void *) (page_to_pfn(page)  PAGE_SHIFT);


I found page_to_virt() macro.


+   rc = __read_vmcore(buf, PAGE_SIZE, src, 0);
+   if (rc  0) {
+   unlock_page(page);
+   page_cache_release(page);
+   return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+   }
+   SetPageUptodate(page);
+   }
+   unlock_page(page);
+   vmf-page = page;
+   return 0;
+}
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+   .fault = mmap_vmcore_fault,
+};
+
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
size_t size = vma-vm_end - vma-vm_start;
@@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct 
vm_area_struct *vma)

vma-vm_flags = ~(VM_MAYWRITE | VM_MAYEXEC);
vma-vm_flags |= VM_MIXEDMAP;
+   vma-vm_ops = vmcore_mmap_ops;

len = 0;



--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-03 Thread Vivek Goyal
On Wed, Jul 03, 2013 at 03:59:33PM +0200, Michael Holzheu wrote:
> On Tue, 2 Jul 2013 11:42:14 -0400
> Vivek Goyal  wrote:
> 
> > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> 
> [snip]
> 
> > > This handler works as follows:
> > > 
> > > * Get already available or new page from page cache (find_or_create_page)
> > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > * If yes:
> > >   Return that page
> > > * If no:
> > >   Fill page using __vmcore_read(), set PageUptodate, and return page
> > > 
> > > Signed-off-by: Michael Holzheu 
> > 
> > In general vmcore related changes look fine to me. I am not very familiar
> > with the logic of finding pages in page cache and using page uptodate
> > flag.
> > 
> > Hatayama, can you please review it.
> > 
> > Acked-by: Vivek Goyal 
> 
> Hello Vivek,
> 
> If Hatayama accepts the patch, should we then send the patch series to
> Andrew Morton?

Yes, once hatayama reviews the patch, please send it to Anderew Morton for
for inclusion of paches.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-03 Thread Michael Holzheu
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal  wrote:

> On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:

[snip]

> > This handler works as follows:
> > 
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> >   Return that page
> > * If no:
> >   Fill page using __vmcore_read(), set PageUptodate, and return page
> > 
> > Signed-off-by: Michael Holzheu 
> 
> In general vmcore related changes look fine to me. I am not very familiar
> with the logic of finding pages in page cache and using page uptodate
> flag.
> 
> Hatayama, can you please review it.
> 
> Acked-by: Vivek Goyal 

Hello Vivek,

If Hatayama accepts the patch, should we then send the patch series to
Andrew Morton?

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-03 Thread Michael Holzheu
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:

[snip]

  This handler works as follows:
  
  * Get already available or new page from page cache (find_or_create_page)
  * Check if /proc/vmcore page is filled with data (PageUptodate)
  * If yes:
Return that page
  * If no:
Fill page using __vmcore_read(), set PageUptodate, and return page
  
  Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
 
 In general vmcore related changes look fine to me. I am not very familiar
 with the logic of finding pages in page cache and using page uptodate
 flag.
 
 Hatayama, can you please review it.
 
 Acked-by: Vivek Goyal vgo...@redhat.com

Hello Vivek,

If Hatayama accepts the patch, should we then send the patch series to
Andrew Morton?

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-03 Thread Vivek Goyal
On Wed, Jul 03, 2013 at 03:59:33PM +0200, Michael Holzheu wrote:
 On Tue, 2 Jul 2013 11:42:14 -0400
 Vivek Goyal vgo...@redhat.com wrote:
 
  On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
 
 [snip]
 
   This handler works as follows:
   
   * Get already available or new page from page cache (find_or_create_page)
   * Check if /proc/vmcore page is filled with data (PageUptodate)
   * If yes:
 Return that page
   * If no:
 Fill page using __vmcore_read(), set PageUptodate, and return page
   
   Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
  
  In general vmcore related changes look fine to me. I am not very familiar
  with the logic of finding pages in page cache and using page uptodate
  flag.
  
  Hatayama, can you please review it.
  
  Acked-by: Vivek Goyal vgo...@redhat.com
 
 Hello Vivek,
 
 If Hatayama accepts the patch, should we then send the patch series to
 Andrew Morton?

Yes, once hatayama reviews the patch, please send it to Anderew Morton for
for inclusion of paches.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-02 Thread Vivek Goyal
On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> For zfcpdump we can't map the HSA storage because it is only available
> via a read interface. Therefore, for the new vmcore mmap feature we have
> introduce a new mechanism to create mappings on demand.
> 
> This patch introduces a new architecture function remap_oldmem_pfn_range()
> that should be used to create mappings with remap_pfn_range() for oldmem
> areas that can be directly mapped. For zfcpdump this is everything besides
> of the HSA memory. For the areas that are not mapped by 
> remap_oldmem_pfn_range()
> a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> is called.
> 
> This handler works as follows:
> 
> * Get already available or new page from page cache (find_or_create_page)
> * Check if /proc/vmcore page is filled with data (PageUptodate)
> * If yes:
>   Return that page
> * If no:
>   Fill page using __vmcore_read(), set PageUptodate, and return page
> 
> Signed-off-by: Michael Holzheu 

In general vmcore related changes look fine to me. I am not very familiar
with the logic of finding pages in page cache and using page uptodate
flag.

Hatayama, can you please review it.

Acked-by: Vivek Goyal 

Thanks
Vivek

> ---
>  fs/proc/vmcore.c   | 84 
> +-
>  include/linux/crash_dump.h |  3 ++
>  2 files changed, 79 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index c28189c..77312c7 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "internal.h"
> @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
> count, u64 *ppos)
>   return read_from_oldmem(buf, count, ppos, 0);
>  }
>  
> +/*
> + * Architectures may override this function to map oldmem
> + */
> +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> +   unsigned long from, unsigned long pfn,
> +   unsigned long size, pgprot_t prot)
> +{
> + return remap_pfn_range(vma, from, pfn, size, prot);
> +}
> +
> +/*
> + * Copy to either kernel or user space
> + */
> +static int copy_to(void *target, void *src, size_t size, int userbuf)
> +{
> + if (userbuf) {
> + if (copy_to_user(target, src, size))
> + return -EFAULT;
> + } else {
> + memcpy(target, src, size);
> + }
> + return 0;
> +}
> +
>  /* Read from the ELF header and then the crash dump. On error, negative 
> value is
>   * returned otherwise number of bytes read are returned.
>   */
> -static ssize_t read_vmcore(struct file *file, char __user *buffer,
> - size_t buflen, loff_t *fpos)
> +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
> +  int userbuf)
>  {
>   ssize_t acc = 0, tmp;
>   size_t tsz;
> @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
> *buffer,
>   /* Read ELF core header */
>   if (*fpos < elfcorebuf_sz) {
>   tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
> - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
> + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
>   return -EFAULT;
>   buflen -= tsz;
>   *fpos += tsz;
> @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
> *buffer,
>  
>   tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
>   kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
> - if (copy_to_user(buffer, kaddr, tsz))
> + if (copy_to(buffer, kaddr, tsz, userbuf))
>   return -EFAULT;
>   buflen -= tsz;
>   *fpos += tsz;
> @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
> *buffer,
>   if (*fpos < m->offset + m->size) {
>   tsz = min_t(size_t, m->offset + m->size - *fpos, 
> buflen);
>   start = m->paddr + *fpos - m->offset;
> - tmp = read_from_oldmem(buffer, tsz, , 1);
> + tmp = read_from_oldmem(buffer, tsz, , userbuf);
>   if (tmp < 0)
>   return tmp;
>   buflen -= tsz;
> @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char 
> __user *buffer,
>   return acc;
>  }
>  
> +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> +size_t buflen, loff_t *fpos)
> +{
> + return __read_vmcore(buffer, buflen, fpos, 1);
> +}
> +
> +/*
> + * The vmcore fault handler uses the page cache and fills data using the
> + * standard __vmcore_read() function.
> + */
> +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
> *vmf)

Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-02 Thread Vivek Goyal
On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
 For zfcpdump we can't map the HSA storage because it is only available
 via a read interface. Therefore, for the new vmcore mmap feature we have
 introduce a new mechanism to create mappings on demand.
 
 This patch introduces a new architecture function remap_oldmem_pfn_range()
 that should be used to create mappings with remap_pfn_range() for oldmem
 areas that can be directly mapped. For zfcpdump this is everything besides
 of the HSA memory. For the areas that are not mapped by 
 remap_oldmem_pfn_range()
 a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
 is called.
 
 This handler works as follows:
 
 * Get already available or new page from page cache (find_or_create_page)
 * Check if /proc/vmcore page is filled with data (PageUptodate)
 * If yes:
   Return that page
 * If no:
   Fill page using __vmcore_read(), set PageUptodate, and return page
 
 Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com

In general vmcore related changes look fine to me. I am not very familiar
with the logic of finding pages in page cache and using page uptodate
flag.

Hatayama, can you please review it.

Acked-by: Vivek Goyal vgo...@redhat.com

Thanks
Vivek

 ---
  fs/proc/vmcore.c   | 84 
 +-
  include/linux/crash_dump.h |  3 ++
  2 files changed, 79 insertions(+), 8 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index c28189c..77312c7 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -21,6 +21,7 @@
  #include linux/crash_dump.h
  #include linux/list.h
  #include linux/vmalloc.h
 +#include linux/pagemap.h
  #include asm/uaccess.h
  #include asm/io.h
  #include internal.h
 @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
 count, u64 *ppos)
   return read_from_oldmem(buf, count, ppos, 0);
  }
  
 +/*
 + * Architectures may override this function to map oldmem
 + */
 +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
 +   unsigned long from, unsigned long pfn,
 +   unsigned long size, pgprot_t prot)
 +{
 + return remap_pfn_range(vma, from, pfn, size, prot);
 +}
 +
 +/*
 + * Copy to either kernel or user space
 + */
 +static int copy_to(void *target, void *src, size_t size, int userbuf)
 +{
 + if (userbuf) {
 + if (copy_to_user(target, src, size))
 + return -EFAULT;
 + } else {
 + memcpy(target, src, size);
 + }
 + return 0;
 +}
 +
  /* Read from the ELF header and then the crash dump. On error, negative 
 value is
   * returned otherwise number of bytes read are returned.
   */
 -static ssize_t read_vmcore(struct file *file, char __user *buffer,
 - size_t buflen, loff_t *fpos)
 +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 +  int userbuf)
  {
   ssize_t acc = 0, tmp;
   size_t tsz;
 @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
 *buffer,
   /* Read ELF core header */
   if (*fpos  elfcorebuf_sz) {
   tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
 - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
 + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
   return -EFAULT;
   buflen -= tsz;
   *fpos += tsz;
 @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
 *buffer,
  
   tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
   kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
 - if (copy_to_user(buffer, kaddr, tsz))
 + if (copy_to(buffer, kaddr, tsz, userbuf))
   return -EFAULT;
   buflen -= tsz;
   *fpos += tsz;
 @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
 *buffer,
   if (*fpos  m-offset + m-size) {
   tsz = min_t(size_t, m-offset + m-size - *fpos, 
 buflen);
   start = m-paddr + *fpos - m-offset;
 - tmp = read_from_oldmem(buffer, tsz, start, 1);
 + tmp = read_from_oldmem(buffer, tsz, start, userbuf);
   if (tmp  0)
   return tmp;
   buflen -= tsz;
 @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char 
 __user *buffer,
   return acc;
  }
  
 +static ssize_t read_vmcore(struct file *file, char __user *buffer,
 +size_t buflen, loff_t *fpos)
 +{
 + return __read_vmcore(buffer, buflen, fpos, 1);
 +}
 +
 +/*
 + * The vmcore fault handler uses the page cache and fills data using the
 + * standard __vmcore_read() function.
 + */
 +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault 
 *vmf)
 +{
 

[PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-01 Thread Michael Holzheu
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu 
---
 fs/proc/vmcore.c   | 84 +-
 include/linux/crash_dump.h |  3 ++
 2 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index c28189c..77312c7 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
count, u64 *ppos)
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, 
buflen);
start = m->paddr + *fpos - m->offset;
-   tmp = read_from_oldmem(buffer, tsz, , 1);
+   tmp = read_from_oldmem(buffer, tsz, , userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
 }
 
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma->vm_file->f_mapping;
+   pgoff_t index = vmf->pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+
+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src = index << PAGE_CACHE_SHIFT;
+   buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
+   

[PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-01 Thread Michael Holzheu
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 fs/proc/vmcore.c   | 84 +-
 include/linux/crash_dump.h |  3 ++
 2 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index c28189c..77312c7 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include linux/crash_dump.h
 #include linux/list.h
 #include linux/vmalloc.h
+#include linux/pagemap.h
 #include asm/uaccess.h
 #include asm/io.h
 #include internal.h
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
count, u64 *ppos)
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
/* Read ELF core header */
if (*fpos  elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
if (*fpos  m-offset + m-size) {
tsz = min_t(size_t, m-offset + m-size - *fpos, 
buflen);
start = m-paddr + *fpos - m-offset;
-   tmp = read_from_oldmem(buffer, tsz, start, 1);
+   tmp = read_from_oldmem(buffer, tsz, start, userbuf);
if (tmp  0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
 }
 
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct address_space *mapping = vma-vm_file-f_mapping;
+   pgoff_t index = vmf-pgoff;
+   struct page *page;
+   loff_t src;
+   char *buf;
+   int rc;
+
+   page = find_or_create_page(mapping, index, GFP_KERNEL);
+   if (!page)
+   return VM_FAULT_OOM;
+   if (!PageUptodate(page)) {
+   src =