Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, Jul 16, 2013 at 05:37:09PM +0200, Michael Holzheu wrote: [..] > The problem is that with the mmap patches we now use copy_oldmem_page() > to copy the notes from oldmem into the notes_buf which has been allocated > with vmalloc. The s390 version of copy_oldmem_page() bypasses the page > tables and uses real copy. And this does not work on vmalloc memory. I got that. I am trying to remember that in 3.10 why are we dropping to real mode in s390 while copying pages. I thought dropping to real mode was needed only to access HSA region. HSA region comes into the picture only for zfcudump and /proc/vmcore support for zfcpdump is not there yet IOW, what if we start copying with page tables enabled in s390. Will it not work for regular kdump case and take care of zfcpdump in 3.12. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 16 Jul 2013 10:04:18 -0400 Vivek Goyal wrote: > On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote: > > [..] > > > > Hello Vivek and Andrew, > > > > > > > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This > > > > currently > > > > breaks s390 kdump because of the following two issues: > > > > > > > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory > > > > 2) The mmap() implementation is not compatible with the current > > > >s390 crashkernel swap: > > > >See: http://marc.info/?l=kexec=136940802511603=2 > > > > > > > > The "kdump: Introduce ELF header in new memory feature" patch series > > > > will > > > > fix both issues for s390. > > > > > > > > There is the one small open discussion left: > > > > > > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html > > > > > > > > But once we have finished that, would it be possible to get the > > > > patches in 3.11? > > > > > > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny > > > mmap() on s390 forcing makedumpfile to fall back on read() interface. That > > > way there will be no regression and mmap() related speedup will show up > > > in next release on s390. > > > > Hello Vivek and Hatayama, > > > > But then we still would have to somehow fix the copy_oldmem_page() issue > > (1). > > > > We would prefer to add the current patch series with "#ifndef CONFIG_S390" > > in > > the fault handler. > > > > @Vivek: > > > > Since you are the kdump maintainer, could you tell us which of the both > > variants you would like to have? > > > > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault > > *vmf) > > { > > ... > > > > #ifndef CONFIG_S390 > > return VM_FAULT_SIGBUS; > > #endif > > I think let us use this one. Kill the process on non-s390 arch if it goes > into mmap() fault handler. > > copy_oldmem_page() using real mode in s390 is an issue for vmalloc region. > I think post the series again and let us see if Andrew is comfortable > with it. Ok great, I will send Andrew the current patch series. Let's see what happens... > I suspect it is late. > IIUC, we do copying in real mode only to cope with HSA region in zfcpdump > case. The problem is that with the mmap patches we now use copy_oldmem_page() to copy the notes from oldmem into the notes_buf which has been allocated with vmalloc. The s390 version of copy_oldmem_page() bypasses the page tables and uses real copy. And this does not work on vmalloc memory. Best Regards, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote: [..] > > > Hello Vivek and Andrew, > > > > > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This > > > currently > > > breaks s390 kdump because of the following two issues: > > > > > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory > > > 2) The mmap() implementation is not compatible with the current > > >s390 crashkernel swap: > > >See: http://marc.info/?l=kexec=136940802511603=2 > > > > > > The "kdump: Introduce ELF header in new memory feature" patch series will > > > fix both issues for s390. > > > > > > There is the one small open discussion left: > > > > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html > > > > > > But once we have finished that, would it be possible to get the > > > patches in 3.11? > > > > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny > > mmap() on s390 forcing makedumpfile to fall back on read() interface. That > > way there will be no regression and mmap() related speedup will show up > > in next release on s390. > > Hello Vivek and Hatayama, > > But then we still would have to somehow fix the copy_oldmem_page() issue (1). > > We would prefer to add the current patch series with "#ifndef CONFIG_S390" in > the fault handler. > > @Vivek: > > Since you are the kdump maintainer, could you tell us which of the both > variants you would like to have? > > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > ... > > #ifndef CONFIG_S390 > return VM_FAULT_SIGBUS; > #endif I think let us use this one. Kill the process on non-s390 arch if it goes into mmap() fault handler. copy_oldmem_page() using real mode in s390 is an issue for vmalloc region. I think post the series again and let us see if Andrew is comfortable with it. I suspect it is late. IIUC, we do copying in real mode only to cope with HSA region in zfcpdump case. Also I think in case of zfcpdump, /proc/vmcore is not usable yet and your patches achieves that. So as an stop gap measure can we stop dropping to real mode so that regular kdump in s390 work. (I am not sure in case of zfcpdump, currently how do you retrieve vmcore as /dev/oldmem is gone now). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/16 9:27), HATAYAMA Daisuke wrote: (2013/07/15 23:20), Vivek Goyal wrote: On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? On the current design, there's no page faults on memory mapped by remap_pfn_range(). They map a whole range in the current design. If there are page faults, page table of the process is broken in their some page entries. This indicates the process's bahaviour is affected by some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot detect the reason and recover the original sane state. The only thing we can do is to kill the process and drop the possibility of the process to affect other system components and of system to result in worse situation. In summary, it seems that you two and I have different implementation policy on how to deal with the process that is no longer in healthy state. You two's idea is try to continue dump in non-healthy state as much as possible as long as there is possibility of continuing it, while my idea kill the process promptly and to retry crash dump in another new process since the process is no longer in healthy state and could behave arbitrarily. The logic in non-healthy states depends on implementation policy since there is no obviously correct logic. I guess this discussion would not end soon. I believe it is supposed that maintainer's idea should basically have high priority over others. So I don't object anymore, though I don't think it best at all. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, 15 Jul 2013 10:27:08 -0400 Vivek Goyal wrote: > On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote: > > On Tue, 2 Jul 2013 11:42:14 -0400 > > Vivek Goyal wrote: > > > > > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: > > > > For zfcpdump we can't map the HSA storage because it is only available > > > > via a read interface. Therefore, for the new vmcore mmap feature we have > > > > introduce a new mechanism to create mappings on demand. > > > > > > > > This patch introduces a new architecture function > > > > remap_oldmem_pfn_range() > > > > that should be used to create mappings with remap_pfn_range() for oldmem > > > > areas that can be directly mapped. For zfcpdump this is everything > > > > besides > > > > of the HSA memory. For the areas that are not mapped by > > > > remap_oldmem_pfn_range() > > > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > > > > is called. > > > > > > > > This handler works as follows: > > > > > > > > * Get already available or new page from page cache > > > > (find_or_create_page) > > > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > > > * If yes: > > > > Return that page > > > > * If no: > > > > Fill page using __vmcore_read(), set PageUptodate, and return page > > > > > > > > Signed-off-by: Michael Holzheu > > > > > > In general vmcore related changes look fine to me. I am not very familiar > > > with the logic of finding pages in page cache and using page uptodate > > > flag. > > > > > > Hatayama, can you please review it. > > > > > > Acked-by: Vivek Goyal > > > > Hello Vivek and Andrew, > > > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This > > currently > > breaks s390 kdump because of the following two issues: > > > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory > > 2) The mmap() implementation is not compatible with the current > >s390 crashkernel swap: > >See: http://marc.info/?l=kexec=136940802511603=2 > > > > The "kdump: Introduce ELF header in new memory feature" patch series will > > fix both issues for s390. > > > > There is the one small open discussion left: > > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html > > > > But once we have finished that, would it be possible to get the > > patches in 3.11? > > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny > mmap() on s390 forcing makedumpfile to fall back on read() interface. That > way there will be no regression and mmap() related speedup will show up > in next release on s390. Hello Vivek and Hatayama, But then we still would have to somehow fix the copy_oldmem_page() issue (1). We would prefer to add the current patch series with "#ifndef CONFIG_S390" in the fault handler. @Vivek: Since you are the kdump maintainer, could you tell us which of the both variants you would like to have? static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif or #ifndef CONFIG_S390 WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()"); #endif For all architectures besides of s390 this would implement the requested behavior. Regards, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, 15 Jul 2013 10:27:08 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote: On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexecm=136940802511603w=2 The kdump: Introduce ELF header in new memory feature patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? How about taking mmap() fault handler patches in 3.12. And in 3.11, deny mmap() on s390 forcing makedumpfile to fall back on read() interface. That way there will be no regression and mmap() related speedup will show up in next release on s390. Hello Vivek and Hatayama, But then we still would have to somehow fix the copy_oldmem_page() issue (1). We would prefer to add the current patch series with #ifndef CONFIG_S390 in the fault handler. @Vivek: Since you are the kdump maintainer, could you tell us which of the both variants you would like to have? static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif or #ifndef CONFIG_S390 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault()); #endif For all architectures besides of s390 this would implement the requested behavior. Regards, Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/16 9:27), HATAYAMA Daisuke wrote: (2013/07/15 23:20), Vivek Goyal wrote: On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? On the current design, there's no page faults on memory mapped by remap_pfn_range(). They map a whole range in the current design. If there are page faults, page table of the process is broken in their some page entries. This indicates the process's bahaviour is affected by some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot detect the reason and recover the original sane state. The only thing we can do is to kill the process and drop the possibility of the process to affect other system components and of system to result in worse situation. In summary, it seems that you two and I have different implementation policy on how to deal with the process that is no longer in healthy state. You two's idea is try to continue dump in non-healthy state as much as possible as long as there is possibility of continuing it, while my idea kill the process promptly and to retry crash dump in another new process since the process is no longer in healthy state and could behave arbitrarily. The logic in non-healthy states depends on implementation policy since there is no obviously correct logic. I guess this discussion would not end soon. I believe it is supposed that maintainer's idea should basically have high priority over others. So I don't object anymore, though I don't think it best at all. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote: [..] Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexecm=136940802511603w=2 The kdump: Introduce ELF header in new memory feature patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? How about taking mmap() fault handler patches in 3.12. And in 3.11, deny mmap() on s390 forcing makedumpfile to fall back on read() interface. That way there will be no regression and mmap() related speedup will show up in next release on s390. Hello Vivek and Hatayama, But then we still would have to somehow fix the copy_oldmem_page() issue (1). We would prefer to add the current patch series with #ifndef CONFIG_S390 in the fault handler. @Vivek: Since you are the kdump maintainer, could you tell us which of the both variants you would like to have? static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif I think let us use this one. Kill the process on non-s390 arch if it goes into mmap() fault handler. copy_oldmem_page() using real mode in s390 is an issue for vmalloc region. I think post the series again and let us see if Andrew is comfortable with it. I suspect it is late. IIUC, we do copying in real mode only to cope with HSA region in zfcpdump case. Also I think in case of zfcpdump, /proc/vmcore is not usable yet and your patches achieves that. So as an stop gap measure can we stop dropping to real mode so that regular kdump in s390 work. (I am not sure in case of zfcpdump, currently how do you retrieve vmcore as /dev/oldmem is gone now). Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 16 Jul 2013 10:04:18 -0400 Vivek Goyal vgo...@redhat.com wrote: On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote: [..] Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexecm=136940802511603w=2 The kdump: Introduce ELF header in new memory feature patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? How about taking mmap() fault handler patches in 3.12. And in 3.11, deny mmap() on s390 forcing makedumpfile to fall back on read() interface. That way there will be no regression and mmap() related speedup will show up in next release on s390. Hello Vivek and Hatayama, But then we still would have to somehow fix the copy_oldmem_page() issue (1). We would prefer to add the current patch series with #ifndef CONFIG_S390 in the fault handler. @Vivek: Since you are the kdump maintainer, could you tell us which of the both variants you would like to have? static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif I think let us use this one. Kill the process on non-s390 arch if it goes into mmap() fault handler. copy_oldmem_page() using real mode in s390 is an issue for vmalloc region. I think post the series again and let us see if Andrew is comfortable with it. Ok great, I will send Andrew the current patch series. Let's see what happens... I suspect it is late. IIUC, we do copying in real mode only to cope with HSA region in zfcpdump case. The problem is that with the mmap patches we now use copy_oldmem_page() to copy the notes from oldmem into the notes_buf which has been allocated with vmalloc. The s390 version of copy_oldmem_page() bypasses the page tables and uses real copy. And this does not work on vmalloc memory. Best Regards, Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, Jul 16, 2013 at 05:37:09PM +0200, Michael Holzheu wrote: [..] The problem is that with the mmap patches we now use copy_oldmem_page() to copy the notes from oldmem into the notes_buf which has been allocated with vmalloc. The s390 version of copy_oldmem_page() bypasses the page tables and uses real copy. And this does not work on vmalloc memory. I got that. I am trying to remember that in 3.10 why are we dropping to real mode in s390 while copying pages. I thought dropping to real mode was needed only to access HSA region. HSA region comes into the picture only for zfcudump and /proc/vmcore support for zfcpdump is not there yet IOW, what if we start copying with page tables enabled in s390. Will it not work for regular kdump case and take care of zfcpdump in 3.12. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/15 18:21), Martin Schwidefsky wrote: On Sat, 13 Jul 2013 01:02:50 +0900 HATAYAMA Daisuke wrote: (2013/07/10 20:00), Michael Holzheu wrote: On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: What you think the sanity check is unnecessary is perfectly wrong. You design page faults always happens on HSA region. If page fault happens on the other parts, i.e. some point of mmap()ed region, it means somehow page table on the address has not been created. This is bug, possibly caused by mmap() itself, page table creation, other components in kernel, bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs now. There's no guarantee that program runs safely, of course for page cache creation, too. We cannot and must expect such buggy process to behave in invalid states just as our design. It results in undefined behaviour. The only thing we can do is to kill the buggy process as soon as possible. I don't quite get this point, please bear with me. If you compare the situation before and after the introduction of the fault handler the possible error scenarios are not almost identical: 1) If an access is made outside of the mapped memory region the first level fault handler (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV right away, independent of the existance of a hole and the vmcore fault handler. 2) If there is a hardware bug that corrupts a page table the behaviour depends on how the entries are corrupted. If the outcome is a valid pte an incorrect memory area will be accessed, the same with or without the vmcore fault handler. If the corrupted pte is an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour does not change for swap and file ptes, you will get funny results in both cases. For empty ptes the new behaviour will call the vmcore fault handler for the address in question. If the read() function can satisfy the request we will get a page cache copy of the missing page, if the read function can not satisfy the request it returns an error which is translated to a SIGBUS. This new behaviour is IMHO better than the old one, it successfully recovers from at least one type of corruption. For x86 that would be the case if the page table is overwritten with zeroes, for s390 a specific bit pattern in the pte is required. As you already noticed here, this senario works well only if there's page table corruption only. Process could be corrupted more. 3) In the case of a programming error in regard to remap_pfn_range the new behaviour will provide page cache copies and the dump will technically be correct. The performance might suffer a little bit as the CPU will have to create the page cache copies but compared to the I/O that is involved with writing a dump this is negligible, no? It seems to me that the warning message you want to see in the fault handler would be a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no? warning message is meaningless, which doesn't stop process. I no longer consider debugging. It's valid to kill the process as soon as possible. If there's page fault to the address we don't intend, it's buggy, and so we don't guess how the buggy process behave. Please consider the cases that system goes into more catastrophic situation after we leave the process running to get warning messages for debugging purposes. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/15 23:20), Vivek Goyal wrote: On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? On the current design, there's no page faults on memory mapped by remap_pfn_range(). They map a whole range in the current design. If there are page faults, page table of the process is broken in their some page entries. This indicates the process's bahaviour is affected by some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot detect the reason and recover the original sane state. The only thing we can do is to kill the process and drop the possibility of the process to affect other system components and of system to result in worse situation. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote: > On Tue, 2 Jul 2013 11:42:14 -0400 > Vivek Goyal wrote: > > > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: > > > For zfcpdump we can't map the HSA storage because it is only available > > > via a read interface. Therefore, for the new vmcore mmap feature we have > > > introduce a new mechanism to create mappings on demand. > > > > > > This patch introduces a new architecture function remap_oldmem_pfn_range() > > > that should be used to create mappings with remap_pfn_range() for oldmem > > > areas that can be directly mapped. For zfcpdump this is everything besides > > > of the HSA memory. For the areas that are not mapped by > > > remap_oldmem_pfn_range() > > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > > > is called. > > > > > > This handler works as follows: > > > > > > * Get already available or new page from page cache (find_or_create_page) > > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > > * If yes: > > > Return that page > > > * If no: > > > Fill page using __vmcore_read(), set PageUptodate, and return page > > > > > > Signed-off-by: Michael Holzheu > > > > In general vmcore related changes look fine to me. I am not very familiar > > with the logic of finding pages in page cache and using page uptodate > > flag. > > > > Hatayama, can you please review it. > > > > Acked-by: Vivek Goyal > > Hello Vivek and Andrew, > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This > currently > breaks s390 kdump because of the following two issues: > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory > 2) The mmap() implementation is not compatible with the current >s390 crashkernel swap: >See: http://marc.info/?l=kexec=136940802511603=2 > > The "kdump: Introduce ELF header in new memory feature" patch series will > fix both issues for s390. > > There is the one small open discussion left: > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html > > But once we have finished that, would it be possible to get the > patches in 3.11? How about taking mmap() fault handler patches in 3.12. And in 3.11, deny mmap() on s390 forcing makedumpfile to fall back on read() interface. That way there will be no regression and mmap() related speedup will show up in next release on s390. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] > How about > > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > ... > char *buf; > int rc; > > #ifndef CONFIG_S390 > return VM_FAULT_SIGBUS; > #endif > page = find_or_create_page(mapping, index, GFP_KERNEL); > > Considering again, I don't think WARN_ONCE() is good now. The fact that fault > occurs on > mmap() region indicates some kind of buggy situation occurs on the process. > The process > should be killed as soon as possible. If user still wants to get crash dump, > he should > try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal wrote: > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: > > For zfcpdump we can't map the HSA storage because it is only available > > via a read interface. Therefore, for the new vmcore mmap feature we have > > introduce a new mechanism to create mappings on demand. > > > > This patch introduces a new architecture function remap_oldmem_pfn_range() > > that should be used to create mappings with remap_pfn_range() for oldmem > > areas that can be directly mapped. For zfcpdump this is everything besides > > of the HSA memory. For the areas that are not mapped by > > remap_oldmem_pfn_range() > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > > is called. > > > > This handler works as follows: > > > > * Get already available or new page from page cache (find_or_create_page) > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > * If yes: > > Return that page > > * If no: > > Fill page using __vmcore_read(), set PageUptodate, and return page > > > > Signed-off-by: Michael Holzheu > > In general vmcore related changes look fine to me. I am not very familiar > with the logic of finding pages in page cache and using page uptodate > flag. > > Hatayama, can you please review it. > > Acked-by: Vivek Goyal Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexec=136940802511603=2 The "kdump: Introduce ELF header in new memory feature" patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Sat, 13 Jul 2013 01:02:50 +0900 HATAYAMA Daisuke wrote: > (2013/07/10 20:00), Michael Holzheu wrote: > > On Wed, 10 Jul 2013 18:50:18 +0900 > > HATAYAMA Daisuke wrote: > > > > [snip] > > > >> (2013/07/10 17:42), Michael Holzheu wrote: > >>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has > >>> the same > >>> effect as your suggestion for all architectures besides of s390. And for > >>> s390 we > >>> take the risk that a programming error would result in poor /proc/vmcore > >>> performance. > >>> > >> > >> If you want to avoid looking up vmcore_list that takes linear time w.r.t. > >> the number > >> of the elements, you can still calculate the range of offsets in > >> /proc/vmcore > >> corresponding to HSA during /proc/vmcore initialization. > >> > >> Also, could you tell me how often and how much the HSA region is during > >> crash dumping? > >> I guess the read to HSA is done mainly during early part of crash dumping > >> process only. > >> According to the code, it appears at most 64MiB only. Then, I feel > >> performance is not > >> a big issue. > > > > Currently it is 32 MiB and normally it is read only once. > > > >> > >> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later > >> calls. I don't > >> think it too much overhead... > > > > I was more concerned about in_valid_fault_range(). But I was most concerned > > the additional > > interface that introduces more complexity to the code. And that just to > > implement a > > sanity check that in our opinion we don't really need. > > > > And what makes it even worse: > > > > What you think the sanity check is unnecessary is perfectly wrong. You design > page faults > always happens on HSA region. If page fault happens on the other parts, i.e. > some point > of mmap()ed region, it means somehow page table on the address has not been > created. This > is bug, possibly caused by mmap() itself, page table creation, other > components in kernel, > bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind > of bug occurs > now. There's no guarantee that program runs safely, of course for page cache > creation, too. > We cannot and must expect such buggy process to behave in invalid states just > as our design. > It results in undefined behaviour. The only thing we can do is to kill the > buggy process > as soon as possible. I don't quite get this point, please bear with me. If you compare the situation before and after the introduction of the fault handler the possible error scenarios are not almost identical: 1) If an access is made outside of the mapped memory region the first level fault handler (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV right away, independent of the existance of a hole and the vmcore fault handler. 2) If there is a hardware bug that corrupts a page table the behaviour depends on how the entries are corrupted. If the outcome is a valid pte an incorrect memory area will be accessed, the same with or without the vmcore fault handler. If the corrupted pte is an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour does not change for swap and file ptes, you will get funny results in both cases. For empty ptes the new behaviour will call the vmcore fault handler for the address in question. If the read() function can satisfy the request we will get a page cache copy of the missing page, if the read function can not satisfy the request it returns an error which is translated to a SIGBUS. This new behaviour is IMHO better than the old one, it successfully recovers from at least one type of corruption. For x86 that would be the case if the page table is overwritten with zeroes, for s390 a specific bit pattern in the pte is required. 3) In the case of a programming error in regard to remap_pfn_range the new behaviour will provide page cache copies and the dump will technically be correct. The performance might suffer a little bit as the CPU will have to create the page cache copies but compared to the I/O that is involved with writing a dump this is negligible, no? It seems to me that the warning message you want to see in the fault handler would be a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no? -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Sat, 13 Jul 2013 01:02:50 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/10 20:00), Michael Holzheu wrote: On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: What you think the sanity check is unnecessary is perfectly wrong. You design page faults always happens on HSA region. If page fault happens on the other parts, i.e. some point of mmap()ed region, it means somehow page table on the address has not been created. This is bug, possibly caused by mmap() itself, page table creation, other components in kernel, bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs now. There's no guarantee that program runs safely, of course for page cache creation, too. We cannot and must expect such buggy process to behave in invalid states just as our design. It results in undefined behaviour. The only thing we can do is to kill the buggy process as soon as possible. I don't quite get this point, please bear with me. If you compare the situation before and after the introduction of the fault handler the possible error scenarios are not almost identical: 1) If an access is made outside of the mapped memory region the first level fault handler (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV right away, independent of the existance of a hole and the vmcore fault handler. 2) If there is a hardware bug that corrupts a page table the behaviour depends on how the entries are corrupted. If the outcome is a valid pte an incorrect memory area will be accessed, the same with or without the vmcore fault handler. If the corrupted pte is an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour does not change for swap and file ptes, you will get funny results in both cases. For empty ptes the new behaviour will call the vmcore fault handler for the address in question. If the read() function can satisfy the request we will get a page cache copy of the missing page, if the read function can not satisfy the request it returns an error which is translated to a SIGBUS. This new behaviour is IMHO better than the old one, it successfully recovers from at least one type of corruption. For x86 that would be the case if the page table is overwritten with zeroes, for s390 a specific bit pattern in the pte is required. 3) In the case of a programming error in regard to remap_pfn_range the new behaviour will provide page cache copies and the dump will technically be correct. The performance might suffer a little bit as the CPU will have to create the page cache copies but compared to the I/O that is involved with writing a dump this is negligible, no? It seems to me that the warning message you want to see in the fault handler would be a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no? -- blue skies, Martin. Reality continues to ruin my life. - Calvin. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexecm=136940802511603w=2 The kdump: Introduce ELF header in new memory feature patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote: On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Hello Vivek and Andrew, We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently breaks s390 kdump because of the following two issues: 1) The copy_oldmem_page() is now used for copying to vmalloc memory 2) The mmap() implementation is not compatible with the current s390 crashkernel swap: See: http://marc.info/?l=kexecm=136940802511603w=2 The kdump: Introduce ELF header in new memory feature patch series will fix both issues for s390. There is the one small open discussion left: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg464856.html But once we have finished that, would it be possible to get the patches in 3.11? How about taking mmap() fault handler patches in 3.12. And in 3.11, deny mmap() on s390 forcing makedumpfile to fall back on read() interface. That way there will be no regression and mmap() related speedup will show up in next release on s390. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/15 23:20), Vivek Goyal wrote: On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote: [..] How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. I don't understand that. Process should be killed only if there was no mapping created for the region process is trying to access. If there is a mapping but we are trying to fault in the actual contents, then it is not a problem of process. Process is accessing a region of memory which it is supposed to access. Potential problem here is that remap_pfn_range() did not map everything it was expected to so we have to resort on page fault handler to read that in. So it is more of a kernel issue and not process issue and for that WARN_ONCE() sounds better? On the current design, there's no page faults on memory mapped by remap_pfn_range(). They map a whole range in the current design. If there are page faults, page table of the process is broken in their some page entries. This indicates the process's bahaviour is affected by some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot detect the reason and recover the original sane state. The only thing we can do is to kill the process and drop the possibility of the process to affect other system components and of system to result in worse situation. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/15 18:21), Martin Schwidefsky wrote: On Sat, 13 Jul 2013 01:02:50 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/10 20:00), Michael Holzheu wrote: On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: What you think the sanity check is unnecessary is perfectly wrong. You design page faults always happens on HSA region. If page fault happens on the other parts, i.e. some point of mmap()ed region, it means somehow page table on the address has not been created. This is bug, possibly caused by mmap() itself, page table creation, other components in kernel, bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs now. There's no guarantee that program runs safely, of course for page cache creation, too. We cannot and must expect such buggy process to behave in invalid states just as our design. It results in undefined behaviour. The only thing we can do is to kill the buggy process as soon as possible. I don't quite get this point, please bear with me. If you compare the situation before and after the introduction of the fault handler the possible error scenarios are not almost identical: 1) If an access is made outside of the mapped memory region the first level fault handler (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV right away, independent of the existance of a hole and the vmcore fault handler. 2) If there is a hardware bug that corrupts a page table the behaviour depends on how the entries are corrupted. If the outcome is a valid pte an incorrect memory area will be accessed, the same with or without the vmcore fault handler. If the corrupted pte is an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour does not change for swap and file ptes, you will get funny results in both cases. For empty ptes the new behaviour will call the vmcore fault handler for the address in question. If the read() function can satisfy the request we will get a page cache copy of the missing page, if the read function can not satisfy the request it returns an error which is translated to a SIGBUS. This new behaviour is IMHO better than the old one, it successfully recovers from at least one type of corruption. For x86 that would be the case if the page table is overwritten with zeroes, for s390 a specific bit pattern in the pte is required. As you already noticed here, this senario works well only if there's page table corruption only. Process could be corrupted more. 3) In the case of a programming error in regard to remap_pfn_range the new behaviour will provide page cache copies and the dump will technically be correct. The performance might suffer a little bit as the CPU will have to create the page cache copies but compared to the I/O that is involved with writing a dump this is negligible, no? It seems to me that the warning message you want to see in the fault handler would be a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no? warning message is meaningless, which doesn't stop process. I no longer consider debugging. It's valid to kill the process as soon as possible. If there's page fault to the address we don't intend, it's buggy, and so we don't guess how the buggy process behave. Please consider the cases that system goes into more catastrophic situation after we leave the process running to get warning messages for debugging purposes. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 20:00), Michael Holzheu wrote: On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: What you think the sanity check is unnecessary is perfectly wrong. You design page faults always happens on HSA region. If page fault happens on the other parts, i.e. some point of mmap()ed region, it means somehow page table on the address has not been created. This is bug, possibly caused by mmap() itself, page table creation, other components in kernel, bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs now. There's no guarantee that program runs safely, of course for page cache creation, too. We cannot and must expect such buggy process to behave in invalid states just as our design. It results in undefined behaviour. The only thing we can do is to kill the buggy process as soon as possible. With the current patch series this check is only relevant for s390 :-) So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()"); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. Could you explain the plan in more detail? Or I cannot review correctly since I don't know whether there's really usecase of this generic fault handler for other architectures. This is the issue for architectures other than s390, not mine; now we don't need it at all. I would have preferred to do the things one after the other. Otherwise I fear that this discussion will never come to an end. This patch series is needed to get zfcpdump running with /proc/vmcore. And for that reason the patch series makes sense for itself. But FYI: The other patch series deals with the problem that we have additional information for s390 that we want to include in /proc/vmcore. We have a one byte storage key per memory page. This storage keys are stored in the s390 firmware and can be read using a s390 specific machine instruction. We plan to put that information into the ELF notes section. For a 1 TiB dump we will have 256 MiB storage keys. Currently the notes section is preallocated in vmcore.c. Because we do not want to preallocate so much memory we would like to read the notes section on demand. Similar to the HSA memory for zfcpdump. To get this work with your mmap code, we would then also use the fault handler to get the notes sections on demand. Our current patch does this for all notes and therefore also the other architectures would then use the fault handler. One advantage for the common code is that no additional memory has to be allocated for the notes buffer. The storage key patch series is currently not final because it depends on the zfcpdump patch series. Yes, on-demand memory allocation by fault handler is definitely suitable for note sections throughout all architectures. But need of sanity check is unchanged here. The issue is orthogonal. It's still needed just as explained above. BTW, sanity check for note sections is easy. if (!(elfnote->p_offset <= offset || offset < elfnote->p_offset + elfnote->p_filesz)) return VM_FAULT_SIGBUS; I assume elfnote has a pointer to a unique PT_NOTE program header table in the ELF header buffer. The code would be the same on every architecture. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 23:33), Vivek Goyal wrote: On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote: [..] If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... Hi Hatayama, I think michael's proposal of just putting in WARN_ONCE for non s390 arch sounds reasonable. It is simple and meets your need of being able to detect that non s390 arch don't make use of mmap() path yet. Introducing in_valid_fault_range() kind of sounds overkill to me for this issue. Thanks Vivek How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 23:33), Vivek Goyal wrote: On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote: [..] If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... Hi Hatayama, I think michael's proposal of just putting in WARN_ONCE for non s390 arch sounds reasonable. It is simple and meets your need of being able to detect that non s390 arch don't make use of mmap() path yet. Introducing in_valid_fault_range() kind of sounds overkill to me for this issue. Thanks Vivek How about static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 return VM_FAULT_SIGBUS; #endif page = find_or_create_page(mapping, index, GFP_KERNEL); Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on mmap() region indicates some kind of buggy situation occurs on the process. The process should be killed as soon as possible. If user still wants to get crash dump, he should try again in another process. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 20:00), Michael Holzheu wrote: On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: What you think the sanity check is unnecessary is perfectly wrong. You design page faults always happens on HSA region. If page fault happens on the other parts, i.e. some point of mmap()ed region, it means somehow page table on the address has not been created. This is bug, possibly caused by mmap() itself, page table creation, other components in kernel, bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs now. There's no guarantee that program runs safely, of course for page cache creation, too. We cannot and must expect such buggy process to behave in invalid states just as our design. It results in undefined behaviour. The only thing we can do is to kill the buggy process as soon as possible. With the current patch series this check is only relevant for s390 :-) So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault()); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. Could you explain the plan in more detail? Or I cannot review correctly since I don't know whether there's really usecase of this generic fault handler for other architectures. This is the issue for architectures other than s390, not mine; now we don't need it at all. I would have preferred to do the things one after the other. Otherwise I fear that this discussion will never come to an end. This patch series is needed to get zfcpdump running with /proc/vmcore. And for that reason the patch series makes sense for itself. But FYI: The other patch series deals with the problem that we have additional information for s390 that we want to include in /proc/vmcore. We have a one byte storage key per memory page. This storage keys are stored in the s390 firmware and can be read using a s390 specific machine instruction. We plan to put that information into the ELF notes section. For a 1 TiB dump we will have 256 MiB storage keys. Currently the notes section is preallocated in vmcore.c. Because we do not want to preallocate so much memory we would like to read the notes section on demand. Similar to the HSA memory for zfcpdump. To get this work with your mmap code, we would then also use the fault handler to get the notes sections on demand. Our current patch does this for all notes and therefore also the other architectures would then use the fault handler. One advantage for the common code is that no additional memory has to be allocated for the notes buffer. The storage key patch series is currently not final because it depends on the zfcpdump patch series. Yes, on-demand memory allocation by fault handler is definitely suitable for note sections throughout all architectures. But need of sanity check is unchanged here. The issue is orthogonal. It's still needed just as explained above. BTW, sanity check for note sections is easy. if (!(elfnote-p_offset = offset || offset elfnote-p_offset + elfnote-p_filesz)) return VM_FAULT_SIGBUS; I assume elfnote has a pointer to a unique PT_NOTE program header table in the ELF header buffer. The code would be the same on every architecture. -- Thanks. HATAYAMA, Daisuke -- To
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote: [..] > If you want to avoid looking up vmcore_list that takes linear time w.r.t. the > number > of the elements, you can still calculate the range of offsets in /proc/vmcore > corresponding to HSA during /proc/vmcore initialization. > > Also, could you tell me how often and how much the HSA region is during crash > dumping? > I guess the read to HSA is done mainly during early part of crash dumping > process only. > According to the code, it appears at most 64MiB only. Then, I feel > performance is not > a big issue. > > Also, cost of WARN_ONCE() is one memory access only in the 2nd and later > calls. I don't > think it too much overhead... Hi Hatayama, I think michael's proposal of just putting in WARN_ONCE for non s390 arch sounds reasonable. It is simple and meets your need of being able to detect that non s390 arch don't make use of mmap() path yet. Introducing in_valid_fault_range() kind of sounds overkill to me for this issue. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke wrote: [snip] > (2013/07/10 17:42), Michael Holzheu wrote: > > My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has > > the same > > effect as your suggestion for all architectures besides of s390. And for > > s390 we > > take the risk that a programming error would result in poor /proc/vmcore > > performance. > > > > If you want to avoid looking up vmcore_list that takes linear time w.r.t. the > number > of the elements, you can still calculate the range of offsets in /proc/vmcore > corresponding to HSA during /proc/vmcore initialization. > > Also, could you tell me how often and how much the HSA region is during crash > dumping? > I guess the read to HSA is done mainly during early part of crash dumping > process only. > According to the code, it appears at most 64MiB only. Then, I feel > performance is not > a big issue. Currently it is 32 MiB and normally it is read only once. > > Also, cost of WARN_ONCE() is one memory access only in the 2nd and later > calls. I don't > think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: With the current patch series this check is only relevant for s390 :-) > > > So, at least for this patch series I would implement the fault handler as > > follows: > > > > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault > > *vmf) > > { > > ... > > char *buf; > > int rc; > > > > #ifndef CONFIG_S390 > > WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()"); > > #endif > > page = find_or_create_page(mapping, index, GFP_KERNEL); > > > > At this point I have to tell you that we plan another vmcore patch series > > where > > the fault handler might be called also for other architectures. But I think > > we > > should *then* discuss your issue again. > > > > Could you explain the plan in more detail? Or I cannot review correctly since > I don't > know whether there's really usecase of this generic fault handler for other > architectures. > This is the issue for architectures other than s390, not mine; now we > don't need it at all. I would have preferred to do the things one after the other. Otherwise I fear that this discussion will never come to an end. This patch series is needed to get zfcpdump running with /proc/vmcore. And for that reason the patch series makes sense for itself. But FYI: The other patch series deals with the problem that we have additional information for s390 that we want to include in /proc/vmcore. We have a one byte storage key per memory page. This storage keys are stored in the s390 firmware and can be read using a s390 specific machine instruction. We plan to put that information into the ELF notes section. For a 1 TiB dump we will have 256 MiB storage keys. Currently the notes section is preallocated in vmcore.c. Because we do not want to preallocate so much memory we would like to read the notes section on demand. Similar to the HSA memory for zfcpdump. To get this work with your mmap code, we would then also use the fault handler to get the notes sections on demand. Our current patch does this for all notes and therefore also the other architectures would then use the fault handler. One advantage for the common code is that no additional memory has to be allocated for the notes buffer. The storage key patch series is currently not final because it depends on the zfcpdump patch series. Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 17:42), Michael Holzheu wrote: Hello Hatayama, On Tue, 09 Jul 2013 14:49:48 +0900 HATAYAMA Daisuke wrote: (2013/07/08 23:28), Vivek Goyal wrote: On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke wrote: [snip] I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. Interface is like this? [generic] bool __weak in_valid_fault_range(pgoff_t pgoff) { return false; } [s390] bool in_valid_fault_range(pgoff_t pgoff) { loff_t offset = pgoff << PAGE_CACHE_SHIFT; u64 paddr = vmcore_offset_to_paddr(offset); return paddr < ZFCPDUMP_HSA_SIZE; } assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical address corresponding to given offset of vmcore. I guess this could return error value if there's no entry corresponding to given offset in vmcore_list. I think this is too much code (and overhead) just for checking the correctness the kdump mmap implementation. My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()"); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. Could you explain the plan in more detail? Or I cannot review correctly since I don't know whether there's really usecase of this generic fault handler for other architectures. This is the issue for architectures other than s390, not mine; now we don't need it at all. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
Hello Hatayama, On Tue, 09 Jul 2013 14:49:48 +0900 HATAYAMA Daisuke wrote: > (2013/07/08 23:28), Vivek Goyal wrote: > > On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: > >> On Mon, 08 Jul 2013 14:32:09 +0900 > >> HATAYAMA Daisuke wrote: [snip] > > I personally perfer not to special case it for s390 only and let the > > handler be generic. > > > > If there is a bug in remap_old_pfn_range(), only side affect is that > > we will fault in the page when it is accessed and that will be slow. BUG() > > sounds excessive. At max it could be WARN_ONCE(). > > > > In regular cases for x86, this path should not even hit. So special casing > > it to detect issues with remap_old_pfn_range() does not sound very good > > to me. I would rather leave it as it is and if there are bugs and mmap() > > slows down, then somebody needs to debug it. > > > > I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. > > Interface is like this? > > [generic] > > bool __weak in_valid_fault_range(pgoff_t pgoff) > { > return false; > } > > [s390] > > bool in_valid_fault_range(pgoff_t pgoff) > { > loff_t offset = pgoff << PAGE_CACHE_SHIFT; > u64 paddr = vmcore_offset_to_paddr(offset); > > return paddr < ZFCPDUMP_HSA_SIZE; > } > > assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns > physical > address corresponding to given offset of vmcore. I guess this could return > error > value if there's no entry corresponding to given offset in vmcore_list. I think this is too much code (and overhead) just for checking the correctness the kdump mmap implementation. My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()"); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. So in order to make progress with this patch series (which is also needed to make your mmap patches work for s390) I would suggest to use the following patch: --- fs/proc/vmcore.c | 90 + include/linux/crash_dump.h |3 + 2 files changed, 85 insertions(+), 8 deletions(-) --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha return read_from_oldmem(buf, count, ppos, 0); } +/* + * Architectures may override this function to map oldmem + */ +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + return remap_pfn_range(vma, from, pfn, size, prot); +} + +/* + * Copy to either kernel or user space + */ +static int copy_to(void *target, void *src, size_t size, int userbuf) +{ + if (userbuf) { + if (copy_to_user(target, src, size)) + return -EFAULT; + } else { + memcpy(target, src, size); + } + return 0; +} + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ -static ssize_t read_vmcore(struct file *file, char __user *buffer, - size_t buflen, loff_t *fpos) +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, +int userbuf) { ssize_t acc = 0, tmp; size_t tsz; @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file * /* Read ELF core header */ if (*fpos < elfcorebuf_sz) { tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file * tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; - if (copy_to_user(buffer, kaddr, tsz)) + if (copy_to(buffer, kaddr, tsz, userbuf)) return -EFAULT;
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
Hello Hatayama, On Tue, 09 Jul 2013 14:49:48 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/08 23:28), Vivek Goyal wrote: On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. Interface is like this? [generic] bool __weak in_valid_fault_range(pgoff_t pgoff) { return false; } [s390] bool in_valid_fault_range(pgoff_t pgoff) { loff_t offset = pgoff PAGE_CACHE_SHIFT; u64 paddr = vmcore_offset_to_paddr(offset); return paddr ZFCPDUMP_HSA_SIZE; } assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical address corresponding to given offset of vmcore. I guess this could return error value if there's no entry corresponding to given offset in vmcore_list. I think this is too much code (and overhead) just for checking the correctness the kdump mmap implementation. My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault()); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. So in order to make progress with this patch series (which is also needed to make your mmap patches work for s390) I would suggest to use the following patch: --- fs/proc/vmcore.c | 90 + include/linux/crash_dump.h |3 + 2 files changed, 85 insertions(+), 8 deletions(-) --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include linux/crash_dump.h #include linux/list.h #include linux/vmalloc.h +#include linux/pagemap.h #include asm/uaccess.h #include asm/io.h #include internal.h @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha return read_from_oldmem(buf, count, ppos, 0); } +/* + * Architectures may override this function to map oldmem + */ +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + return remap_pfn_range(vma, from, pfn, size, prot); +} + +/* + * Copy to either kernel or user space + */ +static int copy_to(void *target, void *src, size_t size, int userbuf) +{ + if (userbuf) { + if (copy_to_user(target, src, size)) + return -EFAULT; + } else { + memcpy(target, src, size); + } + return 0; +} + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ -static ssize_t read_vmcore(struct file *file, char __user *buffer, - size_t buflen, loff_t *fpos) +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, +int userbuf) { ssize_t acc = 0, tmp; size_t tsz; @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file * /* Read ELF core header */ if (*fpos elfcorebuf_sz) { tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file * tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; - if (copy_to_user(buffer, kaddr, tsz)) + if (copy_to(buffer, kaddr, tsz,
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/10 17:42), Michael Holzheu wrote: Hello Hatayama, On Tue, 09 Jul 2013 14:49:48 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/08 23:28), Vivek Goyal wrote: On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. Interface is like this? [generic] bool __weak in_valid_fault_range(pgoff_t pgoff) { return false; } [s390] bool in_valid_fault_range(pgoff_t pgoff) { loff_t offset = pgoff PAGE_CACHE_SHIFT; u64 paddr = vmcore_offset_to_paddr(offset); return paddr ZFCPDUMP_HSA_SIZE; } assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical address corresponding to given offset of vmcore. I guess this could return error value if there's no entry corresponding to given offset in vmcore_list. I think this is too much code (and overhead) just for checking the correctness the kdump mmap implementation. My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault()); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. Could you explain the plan in more detail? Or I cannot review correctly since I don't know whether there's really usecase of this generic fault handler for other architectures. This is the issue for architectures other than s390, not mine; now we don't need it at all. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, 10 Jul 2013 18:50:18 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: [snip] (2013/07/10 17:42), Michael Holzheu wrote: My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same effect as your suggestion for all architectures besides of s390. And for s390 we take the risk that a programming error would result in poor /proc/vmcore performance. If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Currently it is 32 MiB and normally it is read only once. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... I was more concerned about in_valid_fault_range(). But I was most concerned the additional interface that introduces more complexity to the code. And that just to implement a sanity check that in our opinion we don't really need. And what makes it even worse: With the current patch series this check is only relevant for s390 :-) So, at least for this patch series I would implement the fault handler as follows: static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... char *buf; int rc; #ifndef CONFIG_S390 WARN_ONCE(1, vmcore: Unexpected call of mmap_vmcore_fault()); #endif page = find_or_create_page(mapping, index, GFP_KERNEL); At this point I have to tell you that we plan another vmcore patch series where the fault handler might be called also for other architectures. But I think we should *then* discuss your issue again. Could you explain the plan in more detail? Or I cannot review correctly since I don't know whether there's really usecase of this generic fault handler for other architectures. This is the issue for architectures other than s390, not mine; now we don't need it at all. I would have preferred to do the things one after the other. Otherwise I fear that this discussion will never come to an end. This patch series is needed to get zfcpdump running with /proc/vmcore. And for that reason the patch series makes sense for itself. But FYI: The other patch series deals with the problem that we have additional information for s390 that we want to include in /proc/vmcore. We have a one byte storage key per memory page. This storage keys are stored in the s390 firmware and can be read using a s390 specific machine instruction. We plan to put that information into the ELF notes section. For a 1 TiB dump we will have 256 MiB storage keys. Currently the notes section is preallocated in vmcore.c. Because we do not want to preallocate so much memory we would like to read the notes section on demand. Similar to the HSA memory for zfcpdump. To get this work with your mmap code, we would then also use the fault handler to get the notes sections on demand. Our current patch does this for all notes and therefore also the other architectures would then use the fault handler. One advantage for the common code is that no additional memory has to be allocated for the notes buffer. The storage key patch series is currently not final because it depends on the zfcpdump patch series. Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote: [..] If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number of the elements, you can still calculate the range of offsets in /proc/vmcore corresponding to HSA during /proc/vmcore initialization. Also, could you tell me how often and how much the HSA region is during crash dumping? I guess the read to HSA is done mainly during early part of crash dumping process only. According to the code, it appears at most 64MiB only. Then, I feel performance is not a big issue. Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't think it too much overhead... Hi Hatayama, I think michael's proposal of just putting in WARN_ONCE for non s390 arch sounds reasonable. It is simple and meets your need of being able to detect that non s390 arch don't make use of mmap() path yet. Introducing in_valid_fault_range() kind of sounds overkill to me for this issue. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/08 23:28), Vivek Goyal wrote: On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just "return 0" in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. I personally do not like that, but if Vivek and you prefer this, of course we can do that. What about something like: #ifdef CONFIG_S390 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... } #else static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { BUG(); } #endif I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. Interface is like this? [generic] bool __weak in_valid_fault_range(pgoff_t pgoff) { return false; } [s390] bool in_valid_fault_range(pgoff_t pgoff) { loff_t offset = pgoff << PAGE_CACHE_SHIFT; u64 paddr = vmcore_offset_to_paddr(offset); return paddr < ZFCPDUMP_HSA_SIZE; } assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical address corresponding to given offset of vmcore. I guess this could return error value if there's no entry corresponding to given offset in vmcore_list. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/08 18:28), Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just "return 0" in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I meant without modifying source code at all. You say I need to define some function. +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma->vm_file->f_mapping; + pgoff_t index = vmf->pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + You should check where faulting address points to valid range. If the fault happens on invalid range, return VM_FAULT_SIGBUS. On s390 case, I think the range except for HSA should be thought of as invalid. Hmmm, this would require another architecture backend interface. If we get a fault for a page outside of the HSA this would be a programming error in our code. Not sure if we should introduce new architecture interfaces just for sanity checks. I think you need to introduce the backend interface since it's bug if it happens. The current implementation hides such erroneous path due to generic implementation. I also don't think it's big change from this version since you have already been about to introduce several backend interfaces. + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index << PAGE_CACHE_SHIFT; src = (loff_t)index << PAGE_CACHE_SHIFT; loff_t has long long while index has unsigned long. Ok. On s390 both might have the same byte length. On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit and unsigned long 32 bit. Also I prefer offset to src, but this is minor suggestion. Yes, I agree. + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); I found page_to_virt() macro. Looks like page_to_virt() is not usable on most architectures and probably pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not defined on s390. But when I discussed your comment with Martin, we found out that the current buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); is not correct on x86. It should be: buf = __va((page_to_pfn(page) << PAGE_SHIFT)); It seems OK for this. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: > On Mon, 08 Jul 2013 14:32:09 +0900 > HATAYAMA Daisuke wrote: > > > (2013/07/02 4:32), Michael Holzheu wrote: > > > For zfcpdump we can't map the HSA storage because it is only available > > > via a read interface. Therefore, for the new vmcore mmap feature we have > > > introduce a new mechanism to create mappings on demand. > > > > > > This patch introduces a new architecture function remap_oldmem_pfn_range() > > > that should be used to create mappings with remap_pfn_range() for oldmem > > > areas that can be directly mapped. For zfcpdump this is everything besides > > > of the HSA memory. For the areas that are not mapped by > > > remap_oldmem_pfn_range() > > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > > > is called. > > > > > > > This fault handler is only for s390 specific issue. Other architectures > > don't need > > this for the time being. > > > > Also, from the same reason, I'm doing this review based on source code only. > > I cannot run the fault handler on meaningful system, which is currently > > s390 only. > > You can test the code on other architectures if you do not map anything in > advance. > For example you could just "return 0" in remap_oldmem_pfn_range(): > > /* > * Architectures may override this function to map oldmem > */ > int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, > unsigned long from, unsigned long pfn, > unsigned long size, pgprot_t prot) > { > return 0; > } > > In that case for all pages the new mechanism would be used. > > > > > I'm also concerned about the fault handler covers a full range of vmcore, > > which > > could hide some kind of mmap() bug that results in page fault. > > > > So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time > > being. > > I personally do not like that, but if Vivek and you prefer this, of course we > can do that. > > What about something like: > > #ifdef CONFIG_S390 > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > ... > } > #else > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > BUG(); > } > #endif I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke wrote: > (2013/07/02 4:32), Michael Holzheu wrote: > > For zfcpdump we can't map the HSA storage because it is only available > > via a read interface. Therefore, for the new vmcore mmap feature we have > > introduce a new mechanism to create mappings on demand. > > > > This patch introduces a new architecture function remap_oldmem_pfn_range() > > that should be used to create mappings with remap_pfn_range() for oldmem > > areas that can be directly mapped. For zfcpdump this is everything besides > > of the HSA memory. For the areas that are not mapped by > > remap_oldmem_pfn_range() > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > > is called. > > > > This fault handler is only for s390 specific issue. Other architectures don't > need > this for the time being. > > Also, from the same reason, I'm doing this review based on source code only. > I cannot run the fault handler on meaningful system, which is currently s390 > only. You can test the code on other architectures if you do not map anything in advance. For example you could just "return 0" in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. > > I'm also concerned about the fault handler covers a full range of vmcore, > which > could hide some kind of mmap() bug that results in page fault. > > So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time > being. I personally do not like that, but if Vivek and you prefer this, of course we can do that. What about something like: #ifdef CONFIG_S390 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... } #else static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { BUG(); } #endif > > > This handler works as follows: > > > > * Get already available or new page from page cache (find_or_create_page) > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > * If yes: > >Return that page > > * If no: > >Fill page using __vmcore_read(), set PageUptodate, and return page > > > > It seems good to me on this page-in logic. > > > > @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char > > __user *buffer, > > return acc; > > } > > > > +static ssize_t read_vmcore(struct file *file, char __user *buffer, > > + size_t buflen, loff_t *fpos) > > +{ > > + return __read_vmcore(buffer, buflen, fpos, 1); > > +} > > + > > +/* > > + * The vmcore fault handler uses the page cache and fills data using the > > + * standard __vmcore_read() function. > > + */ > > Could you describe usecase of this fault handler on s390? ok. > > > +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault > > *vmf) > > +{ > > + struct address_space *mapping = vma->vm_file->f_mapping; > > + pgoff_t index = vmf->pgoff; > > + struct page *page; > > + loff_t src; > > + char *buf; > > + int rc; > > + > > You should check where faulting address points to valid range. > If the fault happens on invalid range, return VM_FAULT_SIGBUS. > > On s390 case, I think the range except for HSA should be thought of as > invalid. > Hmmm, this would require another architecture backend interface. If we get a fault for a page outside of the HSA this would be a programming error in our code. Not sure if we should introduce new architecture interfaces just for sanity checks. > > + page = find_or_create_page(mapping, index, GFP_KERNEL); > > + if (!page) > > + return VM_FAULT_OOM; > > + if (!PageUptodate(page)) { > > + src = index << PAGE_CACHE_SHIFT; > > src = (loff_t)index << PAGE_CACHE_SHIFT; > > loff_t has long long while index has unsigned long. Ok. > On s390 both might have the same byte length. On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit and unsigned long 32 bit. > Also I prefer offset to src, but this is minor suggestion. Yes, I agree. > > > + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); > > I found page_to_virt() macro. Looks like page_to_virt() is not usable on most architectures and probably pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not defined on s390. But when I discussed your comment with Martin, we found out that the current buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); is not correct on x86. It should be: buf = __va((page_to_pfn(page) << PAGE_SHIFT)); So would be the following patch ok for you? --- fs/proc/vmcore.c | 94 + include/linux/crash_dump.h |3 + 2 files
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just return 0 in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. I personally do not like that, but if Vivek and you prefer this, of course we can do that. What about something like: #ifdef CONFIG_S390 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... } #else static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { BUG(); } #endif This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page It seems good to me on this page-in logic. cut @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, + size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ Could you describe usecase of this fault handler on s390? ok. +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma-vm_file-f_mapping; + pgoff_t index = vmf-pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + You should check where faulting address points to valid range. If the fault happens on invalid range, return VM_FAULT_SIGBUS. On s390 case, I think the range except for HSA should be thought of as invalid. Hmmm, this would require another architecture backend interface. If we get a fault for a page outside of the HSA this would be a programming error in our code. Not sure if we should introduce new architecture interfaces just for sanity checks. + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index PAGE_CACHE_SHIFT; src = (loff_t)index PAGE_CACHE_SHIFT; loff_t has long long while index has unsigned long. Ok. On s390 both might have the same byte length. On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit and unsigned long 32 bit. Also I prefer offset to src, but this is minor suggestion. Yes, I agree. + buf = (void *) (page_to_pfn(page) PAGE_SHIFT); I found page_to_virt() macro. Looks like page_to_virt() is not usable on most architectures and probably pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not defined on s390. But when I discussed your comment with Martin, we found out that the current buf = (void *) (page_to_pfn(page) PAGE_SHIFT); is not correct on x86. It should be: buf = __va((page_to_pfn(page) PAGE_SHIFT)); So would be the following patch ok for you? --- fs/proc/vmcore.c | 94 + include/linux/crash_dump.h |3 + 2 files changed, 89 insertions(+), 8 deletions(-) --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just return 0 in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. I personally do not like that, but if Vivek and you prefer this, of course we can do that. What about something like: #ifdef CONFIG_S390 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... } #else static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { BUG(); } #endif I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/08 18:28), Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just return 0 in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I meant without modifying source code at all. You say I need to define some function. cut +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma-vm_file-f_mapping; + pgoff_t index = vmf-pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + You should check where faulting address points to valid range. If the fault happens on invalid range, return VM_FAULT_SIGBUS. On s390 case, I think the range except for HSA should be thought of as invalid. Hmmm, this would require another architecture backend interface. If we get a fault for a page outside of the HSA this would be a programming error in our code. Not sure if we should introduce new architecture interfaces just for sanity checks. I think you need to introduce the backend interface since it's bug if it happens. The current implementation hides such erroneous path due to generic implementation. I also don't think it's big change from this version since you have already been about to introduce several backend interfaces. + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index PAGE_CACHE_SHIFT; src = (loff_t)index PAGE_CACHE_SHIFT; loff_t has long long while index has unsigned long. Ok. On s390 both might have the same byte length. On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit and unsigned long 32 bit. Also I prefer offset to src, but this is minor suggestion. Yes, I agree. + buf = (void *) (page_to_pfn(page) PAGE_SHIFT); I found page_to_virt() macro. Looks like page_to_virt() is not usable on most architectures and probably pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not defined on s390. But when I discussed your comment with Martin, we found out that the current buf = (void *) (page_to_pfn(page) PAGE_SHIFT); is not correct on x86. It should be: buf = __va((page_to_pfn(page) PAGE_SHIFT)); It seems OK for this. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/08 23:28), Vivek Goyal wrote: On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote: On Mon, 08 Jul 2013 14:32:09 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: (2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. You can test the code on other architectures if you do not map anything in advance. For example you could just return 0 in remap_oldmem_pfn_range(): /* * Architectures may override this function to map oldmem */ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { return 0; } In that case for all pages the new mechanism would be used. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. I personally do not like that, but if Vivek and you prefer this, of course we can do that. What about something like: #ifdef CONFIG_S390 static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { ... } #else static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { BUG(); } #endif I personally perfer not to special case it for s390 only and let the handler be generic. If there is a bug in remap_old_pfn_range(), only side affect is that we will fault in the page when it is accessed and that will be slow. BUG() sounds excessive. At max it could be WARN_ONCE(). In regular cases for x86, this path should not even hit. So special casing it to detect issues with remap_old_pfn_range() does not sound very good to me. I would rather leave it as it is and if there are bugs and mmap() slows down, then somebody needs to debug it. I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs. Interface is like this? [generic] bool __weak in_valid_fault_range(pgoff_t pgoff) { return false; } [s390] bool in_valid_fault_range(pgoff_t pgoff) { loff_t offset = pgoff PAGE_CACHE_SHIFT; u64 paddr = vmcore_offset_to_paddr(offset); return paddr ZFCPDUMP_HSA_SIZE; } assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical address corresponding to given offset of vmcore. I guess this could return error value if there's no entry corresponding to given offset in vmcore_list. -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page It seems good to me on this page-in logic. @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, + size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ Could you describe usecase of this fault handler on s390? +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma->vm_file->f_mapping; + pgoff_t index = vmf->pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + You should check where faulting address points to valid range. If the fault happens on invalid range, return VM_FAULT_SIGBUS. On s390 case, I think the range except for HSA should be thought of as invalid. + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index << PAGE_CACHE_SHIFT; src = (loff_t)index << PAGE_CACHE_SHIFT; loff_t has long long while index has unsigned long. On s390 both might have the same byte length. Also I prefer offset to src, but this is minor suggestion. + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); I found page_to_virt() macro. + rc = __read_vmcore(buf, PAGE_SIZE, , 0); + if (rc < 0) { + unlock_page(page); + page_cache_release(page); + return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS; + } + SetPageUptodate(page); + } + unlock_page(page); + vmf->page = page; + return 0; +} + +static const struct vm_operations_struct vmcore_mmap_ops = { + .fault = mmap_vmcore_fault, +}; + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma->vm_end - vma->vm_start; @@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); vma->vm_flags |= VM_MIXEDMAP; + vma->vm_ops = _mmap_ops; len = 0; -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
(2013/07/02 4:32), Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This fault handler is only for s390 specific issue. Other architectures don't need this for the time being. Also, from the same reason, I'm doing this review based on source code only. I cannot run the fault handler on meaningful system, which is currently s390 only. I'm also concerned about the fault handler covers a full range of vmcore, which could hide some kind of mmap() bug that results in page fault. So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page It seems good to me on this page-in logic. cut @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, + size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ Could you describe usecase of this fault handler on s390? +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma-vm_file-f_mapping; + pgoff_t index = vmf-pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + You should check where faulting address points to valid range. If the fault happens on invalid range, return VM_FAULT_SIGBUS. On s390 case, I think the range except for HSA should be thought of as invalid. + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index PAGE_CACHE_SHIFT; src = (loff_t)index PAGE_CACHE_SHIFT; loff_t has long long while index has unsigned long. On s390 both might have the same byte length. Also I prefer offset to src, but this is minor suggestion. + buf = (void *) (page_to_pfn(page) PAGE_SHIFT); I found page_to_virt() macro. + rc = __read_vmcore(buf, PAGE_SIZE, src, 0); + if (rc 0) { + unlock_page(page); + page_cache_release(page); + return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS; + } + SetPageUptodate(page); + } + unlock_page(page); + vmf-page = page; + return 0; +} + +static const struct vm_operations_struct vmcore_mmap_ops = { + .fault = mmap_vmcore_fault, +}; + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) vma-vm_flags = ~(VM_MAYWRITE | VM_MAYEXEC); vma-vm_flags |= VM_MIXEDMAP; + vma-vm_ops = vmcore_mmap_ops; len = 0; -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, Jul 03, 2013 at 03:59:33PM +0200, Michael Holzheu wrote: > On Tue, 2 Jul 2013 11:42:14 -0400 > Vivek Goyal wrote: > > > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: > > [snip] > > > > This handler works as follows: > > > > > > * Get already available or new page from page cache (find_or_create_page) > > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > > * If yes: > > > Return that page > > > * If no: > > > Fill page using __vmcore_read(), set PageUptodate, and return page > > > > > > Signed-off-by: Michael Holzheu > > > > In general vmcore related changes look fine to me. I am not very familiar > > with the logic of finding pages in page cache and using page uptodate > > flag. > > > > Hatayama, can you please review it. > > > > Acked-by: Vivek Goyal > > Hello Vivek, > > If Hatayama accepts the patch, should we then send the patch series to > Andrew Morton? Yes, once hatayama reviews the patch, please send it to Anderew Morton for for inclusion of paches. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal wrote: > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: [snip] > > This handler works as follows: > > > > * Get already available or new page from page cache (find_or_create_page) > > * Check if /proc/vmcore page is filled with data (PageUptodate) > > * If yes: > > Return that page > > * If no: > > Fill page using __vmcore_read(), set PageUptodate, and return page > > > > Signed-off-by: Michael Holzheu > > In general vmcore related changes look fine to me. I am not very familiar > with the logic of finding pages in page cache and using page uptodate > flag. > > Hatayama, can you please review it. > > Acked-by: Vivek Goyal Hello Vivek, If Hatayama accepts the patch, should we then send the patch series to Andrew Morton? Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: [snip] This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Hello Vivek, If Hatayama accepts the patch, should we then send the patch series to Andrew Morton? Michael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Wed, Jul 03, 2013 at 03:59:33PM +0200, Michael Holzheu wrote: On Tue, 2 Jul 2013 11:42:14 -0400 Vivek Goyal vgo...@redhat.com wrote: On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: [snip] This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Hello Vivek, If Hatayama accepts the patch, should we then send the patch series to Andrew Morton? Yes, once hatayama reviews the patch, please send it to Anderew Morton for for inclusion of paches. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: > For zfcpdump we can't map the HSA storage because it is only available > via a read interface. Therefore, for the new vmcore mmap feature we have > introduce a new mechanism to create mappings on demand. > > This patch introduces a new architecture function remap_oldmem_pfn_range() > that should be used to create mappings with remap_pfn_range() for oldmem > areas that can be directly mapped. For zfcpdump this is everything besides > of the HSA memory. For the areas that are not mapped by > remap_oldmem_pfn_range() > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() > is called. > > This handler works as follows: > > * Get already available or new page from page cache (find_or_create_page) > * Check if /proc/vmcore page is filled with data (PageUptodate) > * If yes: > Return that page > * If no: > Fill page using __vmcore_read(), set PageUptodate, and return page > > Signed-off-by: Michael Holzheu In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal Thanks Vivek > --- > fs/proc/vmcore.c | 84 > +- > include/linux/crash_dump.h | 3 ++ > 2 files changed, 79 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c > index c28189c..77312c7 100644 > --- a/fs/proc/vmcore.c > +++ b/fs/proc/vmcore.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > #include > #include "internal.h" > @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t > count, u64 *ppos) > return read_from_oldmem(buf, count, ppos, 0); > } > > +/* > + * Architectures may override this function to map oldmem > + */ > +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, > + unsigned long from, unsigned long pfn, > + unsigned long size, pgprot_t prot) > +{ > + return remap_pfn_range(vma, from, pfn, size, prot); > +} > + > +/* > + * Copy to either kernel or user space > + */ > +static int copy_to(void *target, void *src, size_t size, int userbuf) > +{ > + if (userbuf) { > + if (copy_to_user(target, src, size)) > + return -EFAULT; > + } else { > + memcpy(target, src, size); > + } > + return 0; > +} > + > /* Read from the ELF header and then the crash dump. On error, negative > value is > * returned otherwise number of bytes read are returned. > */ > -static ssize_t read_vmcore(struct file *file, char __user *buffer, > - size_t buflen, loff_t *fpos) > +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, > + int userbuf) > { > ssize_t acc = 0, tmp; > size_t tsz; > @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user > *buffer, > /* Read ELF core header */ > if (*fpos < elfcorebuf_sz) { > tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); > - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) > + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) > return -EFAULT; > buflen -= tsz; > *fpos += tsz; > @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user > *buffer, > > tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); > kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; > - if (copy_to_user(buffer, kaddr, tsz)) > + if (copy_to(buffer, kaddr, tsz, userbuf)) > return -EFAULT; > buflen -= tsz; > *fpos += tsz; > @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user > *buffer, > if (*fpos < m->offset + m->size) { > tsz = min_t(size_t, m->offset + m->size - *fpos, > buflen); > start = m->paddr + *fpos - m->offset; > - tmp = read_from_oldmem(buffer, tsz, , 1); > + tmp = read_from_oldmem(buffer, tsz, , userbuf); > if (tmp < 0) > return tmp; > buflen -= tsz; > @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char > __user *buffer, > return acc; > } > > +static ssize_t read_vmcore(struct file *file, char __user *buffer, > +size_t buflen, loff_t *fpos) > +{ > + return __read_vmcore(buffer, buflen, fpos, 1); > +} > + > +/* > + * The vmcore fault handler uses the page cache and fills data using the > + * standard __vmcore_read() function. > + */ > +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault > *vmf)
Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote: For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com In general vmcore related changes look fine to me. I am not very familiar with the logic of finding pages in page cache and using page uptodate flag. Hatayama, can you please review it. Acked-by: Vivek Goyal vgo...@redhat.com Thanks Vivek --- fs/proc/vmcore.c | 84 +- include/linux/crash_dump.h | 3 ++ 2 files changed, 79 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index c28189c..77312c7 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include linux/crash_dump.h #include linux/list.h #include linux/vmalloc.h +#include linux/pagemap.h #include asm/uaccess.h #include asm/io.h #include internal.h @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos) return read_from_oldmem(buf, count, ppos, 0); } +/* + * Architectures may override this function to map oldmem + */ +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + return remap_pfn_range(vma, from, pfn, size, prot); +} + +/* + * Copy to either kernel or user space + */ +static int copy_to(void *target, void *src, size_t size, int userbuf) +{ + if (userbuf) { + if (copy_to_user(target, src, size)) + return -EFAULT; + } else { + memcpy(target, src, size); + } + return 0; +} + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ -static ssize_t read_vmcore(struct file *file, char __user *buffer, - size_t buflen, loff_t *fpos) +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, + int userbuf) { ssize_t acc = 0, tmp; size_t tsz; @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, /* Read ELF core header */ if (*fpos elfcorebuf_sz) { tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; - if (copy_to_user(buffer, kaddr, tsz)) + if (copy_to(buffer, kaddr, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, if (*fpos m-offset + m-size) { tsz = min_t(size_t, m-offset + m-size - *fpos, buflen); start = m-paddr + *fpos - m-offset; - tmp = read_from_oldmem(buffer, tsz, start, 1); + tmp = read_from_oldmem(buffer, tsz, start, userbuf); if (tmp 0) return tmp; buflen -= tsz; @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, +size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{
[PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu --- fs/proc/vmcore.c | 84 +- include/linux/crash_dump.h | 3 ++ 2 files changed, 79 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index c28189c..77312c7 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos) return read_from_oldmem(buf, count, ppos, 0); } +/* + * Architectures may override this function to map oldmem + */ +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + return remap_pfn_range(vma, from, pfn, size, prot); +} + +/* + * Copy to either kernel or user space + */ +static int copy_to(void *target, void *src, size_t size, int userbuf) +{ + if (userbuf) { + if (copy_to_user(target, src, size)) + return -EFAULT; + } else { + memcpy(target, src, size); + } + return 0; +} + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ -static ssize_t read_vmcore(struct file *file, char __user *buffer, - size_t buflen, loff_t *fpos) +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, +int userbuf) { ssize_t acc = 0, tmp; size_t tsz; @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, /* Read ELF core header */ if (*fpos < elfcorebuf_sz) { tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; - if (copy_to_user(buffer, kaddr, tsz)) + if (copy_to(buffer, kaddr, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, if (*fpos < m->offset + m->size) { tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); start = m->paddr + *fpos - m->offset; - tmp = read_from_oldmem(buffer, tsz, , 1); + tmp = read_from_oldmem(buffer, tsz, , userbuf); if (tmp < 0) return tmp; buflen -= tsz; @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, + size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma->vm_file->f_mapping; + pgoff_t index = vmf->pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src = index << PAGE_CACHE_SHIFT; + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT); +
[PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com --- fs/proc/vmcore.c | 84 +- include/linux/crash_dump.h | 3 ++ 2 files changed, 79 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index c28189c..77312c7 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -21,6 +21,7 @@ #include linux/crash_dump.h #include linux/list.h #include linux/vmalloc.h +#include linux/pagemap.h #include asm/uaccess.h #include asm/io.h #include internal.h @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos) return read_from_oldmem(buf, count, ppos, 0); } +/* + * Architectures may override this function to map oldmem + */ +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + return remap_pfn_range(vma, from, pfn, size, prot); +} + +/* + * Copy to either kernel or user space + */ +static int copy_to(void *target, void *src, size_t size, int userbuf) +{ + if (userbuf) { + if (copy_to_user(target, src, size)) + return -EFAULT; + } else { + memcpy(target, src, size); + } + return 0; +} + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ -static ssize_t read_vmcore(struct file *file, char __user *buffer, - size_t buflen, loff_t *fpos) +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, +int userbuf) { ssize_t acc = 0, tmp; size_t tsz; @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, /* Read ELF core header */ if (*fpos elfcorebuf_sz) { tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen); - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz)) + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen); kaddr = elfnotes_buf + *fpos - elfcorebuf_sz; - if (copy_to_user(buffer, kaddr, tsz)) + if (copy_to(buffer, kaddr, tsz, userbuf)) return -EFAULT; buflen -= tsz; *fpos += tsz; @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, if (*fpos m-offset + m-size) { tsz = min_t(size_t, m-offset + m-size - *fpos, buflen); start = m-paddr + *fpos - m-offset; - tmp = read_from_oldmem(buffer, tsz, start, 1); + tmp = read_from_oldmem(buffer, tsz, start, userbuf); if (tmp 0) return tmp; buflen -= tsz; @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer, return acc; } +static ssize_t read_vmcore(struct file *file, char __user *buffer, + size_t buflen, loff_t *fpos) +{ + return __read_vmcore(buffer, buflen, fpos, 1); +} + +/* + * The vmcore fault handler uses the page cache and fills data using the + * standard __vmcore_read() function. + */ +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct address_space *mapping = vma-vm_file-f_mapping; + pgoff_t index = vmf-pgoff; + struct page *page; + loff_t src; + char *buf; + int rc; + + page = find_or_create_page(mapping, index, GFP_KERNEL); + if (!page) + return VM_FAULT_OOM; + if (!PageUptodate(page)) { + src =