Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Vivek Goyal Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 10:49:29 -0400 > On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote: >> HATAYAMA Daisuke writes: >> >> > OK, rigorously, suceess or faliure of the requested free pages >> > allocation depends on actual memory layout at the 2nd kernel boot. To >> > increase the possibility of allocating memory, we have no method but >> > reserve more memory for the 2nd kernel now. >> >> Good enough. If there are fragmentation issues that cause allocation >> problems on larger boxes we can use vmalloc and remap_vmalloc_range, but >> we certainly don't need to start there. >> >> Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or >> an 8KiBP allocation. Aka order 0 or order 1. >> > > Actually we are already handling the large SGI machines so we need > to plan for 4096 cpus now while we write these patches. > > vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what > we should probaly use. > > Alternatively why not allocate everything in 4K pages and use vmcore_list > to map offset into right addresses and call remap_pfn_range() on these > addresses. I have an introductory question about design of vmalloc. My understanding is that vmalloc allocates *pages* enough to cover a requested size and returns the first corresponding virtual address. So, the address returned is inherently always page-size aligned. It looks like vmalloc does so in the current implementation, but I don't know older implementations and I cannot make sure this is guranteed in vmalloc's interface. There's the comment explaing the interface of vmalloc as below, but it seems to me a little vague in that it doesn't say clearly what's is returned as an address. /** * vmalloc - allocate virtually contiguous memory * @size: allocation size * Allocate enough pages to cover @size from the page level * allocator and map them into contiguous kernel virtual space. * * For tight control over page level allocator and protection flags * use __vmalloc() instead. */ void *vmalloc(unsigned long size) { return __vmalloc_node_flags(size, NUMA_NO_NODE, GFP_KERNEL | __GFP_HIGHMEM); } EXPORT_SYMBOL(vmalloc); BTW, simple test module code also shows they returns page-size aligned objects, where 1-byte objects are allocated 12-times. $ dmesg | tail -n 12 [3552817.290982] test: objects[0] = c960c000 [3552817.291197] test: objects[1] = c960e000 [3552817.291379] test: objects[2] = c967d000 [3552817.291566] test: objects[3] = c90010f99000 [3552817.291833] test: objects[4] = c90010f9b000 [3552817.292015] test: objects[5] = c90010f9d000 [3552817.292207] test: objects[6] = c90010f9f000 [3552817.292386] test: objects[7] = c90010fa1000 [3552817.292574] test: objects[8] = c90010fa3000 [3552817.292785] test: objects[9] = c90010fa5000 [3552817.292964] test: objects[10] = c90010fa7000 [3552817.293143] test: objects[11] = c90010fa9000 Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 10:49:29 -0400 On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: OK, rigorously, suceess or faliure of the requested free pages allocation depends on actual memory layout at the 2nd kernel boot. To increase the possibility of allocating memory, we have no method but reserve more memory for the 2nd kernel now. Good enough. If there are fragmentation issues that cause allocation problems on larger boxes we can use vmalloc and remap_vmalloc_range, but we certainly don't need to start there. Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or an 8KiBP allocation. Aka order 0 or order 1. Actually we are already handling the large SGI machines so we need to plan for 4096 cpus now while we write these patches. vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what we should probaly use. Alternatively why not allocate everything in 4K pages and use vmcore_list to map offset into right addresses and call remap_pfn_range() on these addresses. I have an introductory question about design of vmalloc. My understanding is that vmalloc allocates *pages* enough to cover a requested size and returns the first corresponding virtual address. So, the address returned is inherently always page-size aligned. It looks like vmalloc does so in the current implementation, but I don't know older implementations and I cannot make sure this is guranteed in vmalloc's interface. There's the comment explaing the interface of vmalloc as below, but it seems to me a little vague in that it doesn't say clearly what's is returned as an address. /** * vmalloc - allocate virtually contiguous memory * @size: allocation size * Allocate enough pages to cover @size from the page level * allocator and map them into contiguous kernel virtual space. * * For tight control over page level allocator and protection flags * use __vmalloc() instead. */ void *vmalloc(unsigned long size) { return __vmalloc_node_flags(size, NUMA_NO_NODE, GFP_KERNEL | __GFP_HIGHMEM); } EXPORT_SYMBOL(vmalloc); BTW, simple test module code also shows they returns page-size aligned objects, where 1-byte objects are allocated 12-times. $ dmesg | tail -n 12 [3552817.290982] test: objects[0] = c960c000 [3552817.291197] test: objects[1] = c960e000 [3552817.291379] test: objects[2] = c967d000 [3552817.291566] test: objects[3] = c90010f99000 [3552817.291833] test: objects[4] = c90010f9b000 [3552817.292015] test: objects[5] = c90010f9d000 [3552817.292207] test: objects[6] = c90010f9f000 [3552817.292386] test: objects[7] = c90010fa1000 [3552817.292574] test: objects[8] = c90010fa3000 [3552817.292785] test: objects[9] = c90010fa5000 [3552817.292964] test: objects[10] = c90010fa7000 [3552817.293143] test: objects[11] = c90010fa9000 Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: "Eric W. Biederman" Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 17:54:22 -0700 > Vivek Goyal writes: > >> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: >> >> [..] >>> So if starting or end address of PT_LOAD header is not aligned, why >>> not we simply allocate a page. Copy the relevant data from old memory, >>> fill rest with zero. That way mmap and read view will be same. There >>> will be no surprises w.r.t reading old kernel memory beyond what's >>> specified by the headers. >> >> Copying from old memory might spring surprises w.r.t hw poisoned >> pages. I guess we will have to disable MCE, read page, enable it >> back or something like that to take care of these issues. >> >> In the past we have recommended makedumpfile to be careful, look >> at struct pages and make sure we are not reading poisoned pages. >> But vmcore itself is reading old memory and can run into this >> issue too. > > Vivek you are overthinking this. > > If there are issues with reading partially exported pages we should > fix them in kexec-tools or in the kernel where the data is exported. > > In the examples given in the patch what we were looking at were cases > where the BIOS rightly or wrongly was saying kernel this is my memory > stay off. But it was all perfectly healthy memory. > > /proc/vmcore is a simple data dumper and prettifier. Let's keep it that > way so that we can predict how it will act when we feed it information. > /proc/vmcore should not be worrying about or covering up sins elsewhere > in the system. > > At the level of /proc/vmcore we may want to do something about ensuring > MCE's don't kill us. But that is an orthogonal problem. This is the part of old memory /proc/vmcore must read at its initialization to generate its meta data, i.e. ELF header, program header table and ELF note segments. Other memory chunks are part makedumpfile should decide whether to read or avoid. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Vivek Goyal writes: > On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: > > [..] >> So if starting or end address of PT_LOAD header is not aligned, why >> not we simply allocate a page. Copy the relevant data from old memory, >> fill rest with zero. That way mmap and read view will be same. There >> will be no surprises w.r.t reading old kernel memory beyond what's >> specified by the headers. > > Copying from old memory might spring surprises w.r.t hw poisoned > pages. I guess we will have to disable MCE, read page, enable it > back or something like that to take care of these issues. > > In the past we have recommended makedumpfile to be careful, look > at struct pages and make sure we are not reading poisoned pages. > But vmcore itself is reading old memory and can run into this > issue too. Vivek you are overthinking this. If there are issues with reading partially exported pages we should fix them in kexec-tools or in the kernel where the data is exported. In the examples given in the patch what we were looking at were cases where the BIOS rightly or wrongly was saying kernel this is my memory stay off. But it was all perfectly healthy memory. /proc/vmcore is a simple data dumper and prettifier. Let's keep it that way so that we can predict how it will act when we feed it information. /proc/vmcore should not be worrying about or covering up sins elsewhere in the system. At the level of /proc/vmcore we may want to do something about ensuring MCE's don't kill us. But that is an orthogonal problem. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Vivek Goyal Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 11:27:51 -0400 > On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: > > [..] >> So if starting or end address of PT_LOAD header is not aligned, why >> not we simply allocate a page. Copy the relevant data from old memory, >> fill rest with zero. That way mmap and read view will be same. There >> will be no surprises w.r.t reading old kernel memory beyond what's >> specified by the headers. > > Copying from old memory might spring surprises w.r.t hw poisoned > pages. I guess we will have to disable MCE, read page, enable it > back or something like that to take care of these issues. > > In the past we have recommended makedumpfile to be careful, look > at struct pages and make sure we are not reading poisoned pages. > But vmcore itself is reading old memory and can run into this > issue too. Yes, that has been already implemented in makedumpfile. Not only copying, but also mmaping poisoned pages might be problematic due to hardware cache prefetch performed by creation of page table to the poisoned pages. Or MCE disables the prefetch? I'm not sure but I'll investigate this. makedumpfile might also take care of calling mmap. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: [..] > So if starting or end address of PT_LOAD header is not aligned, why > not we simply allocate a page. Copy the relevant data from old memory, > fill rest with zero. That way mmap and read view will be same. There > will be no surprises w.r.t reading old kernel memory beyond what's > specified by the headers. Copying from old memory might spring surprises w.r.t hw poisoned pages. I guess we will have to disable MCE, read page, enable it back or something like that to take care of these issues. In the past we have recommended makedumpfile to be careful, look at struct pages and make sure we are not reading poisoned pages. But vmcore itself is reading old memory and can run into this issue too. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 12:07:12AM -0700, Eric W. Biederman wrote: [..] > I think the two having different contents violates the principle of > least surprise. > > I think exporting the old memory as the ``extra data'' is the least > surprising and the easiest way to go. > > I don't mind filling the extra data with zero's but I don't see the > point. I think only question would be if there is a problem in reading memory areas which BIOS has kept reserved or possibly not exported. Are there any surprises to be expected. (machines reboots while trying to reboot a particular memory location etc). So trying to zero the extra data can make theoritically make it somewhat safer. So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. And in practice I am not expecting many PT_LOAD ranges which are unaligned. Just few. And allocating a few 4K pages should not be a big deal. And vmcore_list will help us again map whether pfn lies in old memory or new memory. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Wed, Mar 20, 2013 at 11:29:05PM -0700, Eric W. Biederman wrote: [..] > Preserving the actual PT_LOAD segments p_paddr and p_memsz values is > important. p_offset we can change as much as we want. Which means there > can be logical holes in the file between PT_LOAD segments, where we put > the extra data needed to keep everything page aligned. Agreed. If one modifies p_paddr then one will have to modify p->vaddr too. And user space tools look at p->vaddr and where is the corresponding physical address. Keeping p_vaddr and p_paddr intact makes sense. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote: > HATAYAMA Daisuke writes: > > > OK, rigorously, suceess or faliure of the requested free pages > > allocation depends on actual memory layout at the 2nd kernel boot. To > > increase the possibility of allocating memory, we have no method but > > reserve more memory for the 2nd kernel now. > > Good enough. If there are fragmentation issues that cause allocation > problems on larger boxes we can use vmalloc and remap_vmalloc_range, but > we certainly don't need to start there. > > Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or > an 8KiBP allocation. Aka order 0 or order 1. > Actually we are already handling the large SGI machines so we need to plan for 4096 cpus now while we write these patches. vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what we should probaly use. Alternatively why not allocate everything in 4K pages and use vmcore_list to map offset into right addresses and call remap_pfn_range() on these addresses. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Wed, Mar 20, 2013 at 01:55:55PM -0700, Eric W. Biederman wrote: [..] > If core counts on the high end do more than double every 2 years we > might have a problem. Otherwise making everything mmapable seems easy > and sound. We already have mechanism to translate file offset into actual physical address where data is. So if we can't allocate one contiguous chunk of memory for notes, we should be able to break it down into multiple page aligned areas and map offset into respective discontiguous areas using vmcore_list. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke writes: > OK, rigorously, suceess or faliure of the requested free pages > allocation depends on actual memory layout at the 2nd kernel boot. To > increase the possibility of allocating memory, we have no method but > reserve more memory for the 2nd kernel now. Good enough. If there are fragmentation issues that cause allocation problems on larger boxes we can use vmalloc and remap_vmalloc_range, but we certainly don't need to start there. Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or an 8KiBP allocation. Aka order 0 or order 1. Adding more memory is also useful. It is important in general to keep the amount of memory needed for the kdump kernel low. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke writes: > From: "Eric W. Biederman" > Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s > page-size boundary requirement > Date: Wed, 20 Mar 2013 23:29:05 -0700 > >> HATAYAMA Daisuke writes: >>> >>> Do you mean for each range represented by each PT_LOAD entry, say: >>> >>> [p_paddr, p_paddr + p_memsz] >>> >>> extend it as: >>> >>> [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. >>> >>> not only objects in vmcore_list, but also updating p_paddr and p_memsz >>> members themselves of each PT_LOAD entry? In other words, there's no >>> new holes not referenced by any PT_LOAD entry since the regions >>> referenced by some PT_LOAD entry, themselves are extended. >> >> No. p_paddr and p_memsz as exported should remain the same. >> I am suggesting that we change p_offset. >> >> I am suggesting to include the data in the file as if we had changed >> p_paddr and p_memsz. >> >>> Then, the vmcores seen from read and mmap methods are coincide in the >>> direction of including both ranges >>> >>> [rounddown(p_paddr, PAGE_SIZE), p_paddr] >>> >>> and >>> >>> [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] >>> >>> are included in both vmcores seen from read and mmap methods, although >>> they are originally not dump target memory, which you are not >>> problematic for ease of implementation. >>> >>> Is there difference here from you understanding? >> >> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is >> important. p_offset we can change as much as we want. Which means there >> can be logical holes in the file between PT_LOAD segments, where we put >> the extra data needed to keep everything page aligned. >> > > So, I have to make the same question again. Is it OK if two vmcores > are different? How do you intend the ``extra data'' to be deal with? I > mean mmap() has to export part of old memory as the ``extra data''. > > If you think OK, I'll fill the ``extra data'' with 0 in case of read > method. If not OK, I'll fill with the corresponding part of old > memory. I think the two having different contents violates the principle of least surprise. I think exporting the old memory as the ``extra data'' is the least surprising and the easiest way to go. I don't mind filling the extra data with zero's but I don't see the point. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: "Eric W. Biederman" Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 23:29:05 -0700 > HATAYAMA Daisuke writes: >> >> Do you mean for each range represented by each PT_LOAD entry, say: >> >> [p_paddr, p_paddr + p_memsz] >> >> extend it as: >> >> [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. >> >> not only objects in vmcore_list, but also updating p_paddr and p_memsz >> members themselves of each PT_LOAD entry? In other words, there's no >> new holes not referenced by any PT_LOAD entry since the regions >> referenced by some PT_LOAD entry, themselves are extended. > > No. p_paddr and p_memsz as exported should remain the same. > I am suggesting that we change p_offset. > > I am suggesting to include the data in the file as if we had changed > p_paddr and p_memsz. > >> Then, the vmcores seen from read and mmap methods are coincide in the >> direction of including both ranges >> >> [rounddown(p_paddr, PAGE_SIZE), p_paddr] >> >> and >> >> [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] >> >> are included in both vmcores seen from read and mmap methods, although >> they are originally not dump target memory, which you are not >> problematic for ease of implementation. >> >> Is there difference here from you understanding? > > Preserving the actual PT_LOAD segments p_paddr and p_memsz values is > important. p_offset we can change as much as we want. Which means there > can be logical holes in the file between PT_LOAD segments, where we put > the extra data needed to keep everything page aligned. > So, I have to make the same question again. Is it OK if two vmcores are different? How do you intend the ``extra data'' to be deal with? I mean mmap() has to export part of old memory as the ``extra data''. If you think OK, I'll fill the ``extra data'' with 0 in case of read method. If not OK, I'll fill with the corresponding part of old memory. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke writes: > > Do you mean for each range represented by each PT_LOAD entry, say: > > [p_paddr, p_paddr + p_memsz] > > extend it as: > > [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. > > not only objects in vmcore_list, but also updating p_paddr and p_memsz > members themselves of each PT_LOAD entry? In other words, there's no > new holes not referenced by any PT_LOAD entry since the regions > referenced by some PT_LOAD entry, themselves are extended. No. p_paddr and p_memsz as exported should remain the same. I am suggesting that we change p_offset. I am suggesting to include the data in the file as if we had changed p_paddr and p_memsz. > Then, the vmcores seen from read and mmap methods are coincide in the > direction of including both ranges > > [rounddown(p_paddr, PAGE_SIZE), p_paddr] > > and > > [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] > > are included in both vmcores seen from read and mmap methods, although > they are originally not dump target memory, which you are not > problematic for ease of implementation. > > Is there difference here from you understanding? Preserving the actual PT_LOAD segments p_paddr and p_memsz values is important. p_offset we can change as much as we want. Which means there can be logical holes in the file between PT_LOAD segments, where we put the extra data needed to keep everything page aligned. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: "Eric W. Biederman" Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 21:18:37 -0700 > HATAYAMA Daisuke writes: > >> From: "Eric W. Biederman" >> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify >> mmap()'s page-size boundary requirement >> Date: Wed, 20 Mar 2013 13:55:55 -0700 >> >>> Vivek Goyal writes: >>> >>>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: >>>>> HATAYAMA Daisuke writes: >>>>> >>>>> > If there's some vmcore object that doesn't satisfy page-size boundary >>>>> > requirement, remap_pfn_range() fails to remap it to user-space. >>>>> > >>>>> > Objects that posisbly don't satisfy the requirement are ELF note >>>>> > segments only. The memory chunks corresponding to PT_LOAD entries are >>>>> > guaranteed to satisfy page-size boundary requirement by the copy from >>>>> > old memory to buffer in 2nd kernel done in later patch. >>>>> > >>>>> > This patch doesn't copy each note segment into the 2nd kernel since >>>>> > they amount to so large in total if there are multiple CPUs. For >>>>> > example, current maximum number of CPUs in x86_64 is 5120, where note >>>>> > segments exceed 1MB with NT_PRSTATUS only. >>>>> >>>>> So you require the first kernel to reserve an additional 20MB, instead >>>>> of just 1.6MB. 336 bytes versus 4096 bytes. >>>>> >>>>> That seems like completely the wrong tradeoff in memory consumption, >>>>> filesize, and backwards compatibility. >>>> >>>> Agreed. >>>> >>>> So we already copy ELF headers in second kernel's memory. If we start >>>> copying notes too, then both headers and notes will support mmap(). >>> >>> The only real is it could be a bit tricky to allocate all of the memory >>> for the notes section on high cpu count systems in a single allocation. >>> >> >> Do you mean it's getting difficult on many-cpus machine to get free >> pages consequtive enough to be able to cover all the notes? >> >> If so, is it necessary to think about any care to it in the next >> patch? Or, should it be pending for now? > > I meant that in general allocations > PAGE_SIZE get increasingly > unreliable the larger they are. And on large cpu count machines we are > having larger allocations. Of course large cpu count machines typically > have more memory so the odds go up. > > Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64 > machine certainly succeeded in an order 11 allocation during boot so I > don't expect any real problems with a 2MiB allocation but it is > something to keep an eye on with kernel memory. > OK, rigorously, suceess or faliure of the requested free pages allocation depends on actual memory layout at the 2nd kernel boot. To increase the possibility of allocating memory, we have no method but reserve more memory for the 2nd kernel now. >>>> For mmap() of memory regions which are not page aligned, we can map >>>> extra bytes (as you suggested in one of the mails). Given the fact >>>> that we have one ELF header for every memory range, we can always modify >>>> the file offset where phdr data is starting to make space for mapping >>>> of extra bytes. >>> >>> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to >>> make mmap work. >>> >> >> OK, your conclusion is the 1st version is better than the 2nd. >> >> The purpose of this design was not to export anything but dump target >> memory to user-space from /proc/vmcore. I think it better to do it if >> possible. it's possible for read interface to fill the corresponding >> part with 0. But it's impossible for mmap interface to data on modify >> old memory. > > In practice someone lied. You can't have a chunk of memory that is > smaller than page size. So I don't see it doing any harm to export > the memory that is there but some silly system lied to us about. > >> Do you agree two vmcores seen from read and mmap interfaces are no >> longer coincide? > > That is an interesting point. I don't think there is any point in > having read and mmap disagree, that just seems to lead to complications, > especially since the data we are talking about adding is actually
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 21:18:37 -0700 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 13:55:55 -0700 Vivek Goyal vgo...@redhat.com writes: On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). The only real is it could be a bit tricky to allocate all of the memory for the notes section on high cpu count systems in a single allocation. Do you mean it's getting difficult on many-cpus machine to get free pages consequtive enough to be able to cover all the notes? If so, is it necessary to think about any care to it in the next patch? Or, should it be pending for now? I meant that in general allocations PAGE_SIZE get increasingly unreliable the larger they are. And on large cpu count machines we are having larger allocations. Of course large cpu count machines typically have more memory so the odds go up. Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64 machine certainly succeeded in an order 11 allocation during boot so I don't expect any real problems with a 2MiB allocation but it is something to keep an eye on with kernel memory. OK, rigorously, suceess or faliure of the requested free pages allocation depends on actual memory layout at the 2nd kernel boot. To increase the possibility of allocating memory, we have no method but reserve more memory for the 2nd kernel now. For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to make mmap work. OK, your conclusion is the 1st version is better than the 2nd. The purpose of this design was not to export anything but dump target memory to user-space from /proc/vmcore. I think it better to do it if possible. it's possible for read interface to fill the corresponding part with 0. But it's impossible for mmap interface to data on modify old memory. In practice someone lied. You can't have a chunk of memory that is smaller than page size. So I don't see it doing any harm to export the memory that is there but some silly system lied to us about. Do you agree two vmcores seen from read and mmap interfaces are no longer coincide? That is an interesting point. I don't think there is any point in having read and mmap disagree, that just seems to lead to complications, especially since the data we are talking about adding is actually memory contents. I do think it makes sense to have logical chunks of the file that are not covered by PT_LOAD segments. Logical chunks like the leading edge of a page inside of which a PT_LOAD segment starts, and the trailing edge of a page in which a PT_LOAD segment ends. Implementaton wise this would mean extending the struct vmcore entry to cover missing bits, by rounding down the start address and rounding up the end address to the nearest page size boundary. The generated PT_LOAD segment would then have it's file offset adjusted to point skip the bytes of the page that are there but we don't care about. Do you mean for each range represented by each PT_LOAD entry, say: [p_paddr, p_paddr + p_memsz] extend it as: [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. not only objects in vmcore_list, but also updating p_paddr and p_memsz members themselves of each
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: Do you mean for each range represented by each PT_LOAD entry, say: [p_paddr, p_paddr + p_memsz] extend it as: [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. not only objects in vmcore_list, but also updating p_paddr and p_memsz members themselves of each PT_LOAD entry? In other words, there's no new holes not referenced by any PT_LOAD entry since the regions referenced by some PT_LOAD entry, themselves are extended. No. p_paddr and p_memsz as exported should remain the same. I am suggesting that we change p_offset. I am suggesting to include the data in the file as if we had changed p_paddr and p_memsz. Then, the vmcores seen from read and mmap methods are coincide in the direction of including both ranges [rounddown(p_paddr, PAGE_SIZE), p_paddr] and [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] are included in both vmcores seen from read and mmap methods, although they are originally not dump target memory, which you are not problematic for ease of implementation. Is there difference here from you understanding? Preserving the actual PT_LOAD segments p_paddr and p_memsz values is important. p_offset we can change as much as we want. Which means there can be logical holes in the file between PT_LOAD segments, where we put the extra data needed to keep everything page aligned. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 23:29:05 -0700 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: Do you mean for each range represented by each PT_LOAD entry, say: [p_paddr, p_paddr + p_memsz] extend it as: [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. not only objects in vmcore_list, but also updating p_paddr and p_memsz members themselves of each PT_LOAD entry? In other words, there's no new holes not referenced by any PT_LOAD entry since the regions referenced by some PT_LOAD entry, themselves are extended. No. p_paddr and p_memsz as exported should remain the same. I am suggesting that we change p_offset. I am suggesting to include the data in the file as if we had changed p_paddr and p_memsz. Then, the vmcores seen from read and mmap methods are coincide in the direction of including both ranges [rounddown(p_paddr, PAGE_SIZE), p_paddr] and [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] are included in both vmcores seen from read and mmap methods, although they are originally not dump target memory, which you are not problematic for ease of implementation. Is there difference here from you understanding? Preserving the actual PT_LOAD segments p_paddr and p_memsz values is important. p_offset we can change as much as we want. Which means there can be logical holes in the file between PT_LOAD segments, where we put the extra data needed to keep everything page aligned. So, I have to make the same question again. Is it OK if two vmcores are different? How do you intend the ``extra data'' to be deal with? I mean mmap() has to export part of old memory as the ``extra data''. If you think OK, I'll fill the ``extra data'' with 0 in case of read method. If not OK, I'll fill with the corresponding part of old memory. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 23:29:05 -0700 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: Do you mean for each range represented by each PT_LOAD entry, say: [p_paddr, p_paddr + p_memsz] extend it as: [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)]. not only objects in vmcore_list, but also updating p_paddr and p_memsz members themselves of each PT_LOAD entry? In other words, there's no new holes not referenced by any PT_LOAD entry since the regions referenced by some PT_LOAD entry, themselves are extended. No. p_paddr and p_memsz as exported should remain the same. I am suggesting that we change p_offset. I am suggesting to include the data in the file as if we had changed p_paddr and p_memsz. Then, the vmcores seen from read and mmap methods are coincide in the direction of including both ranges [rounddown(p_paddr, PAGE_SIZE), p_paddr] and [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)] are included in both vmcores seen from read and mmap methods, although they are originally not dump target memory, which you are not problematic for ease of implementation. Is there difference here from you understanding? Preserving the actual PT_LOAD segments p_paddr and p_memsz values is important. p_offset we can change as much as we want. Which means there can be logical holes in the file between PT_LOAD segments, where we put the extra data needed to keep everything page aligned. So, I have to make the same question again. Is it OK if two vmcores are different? How do you intend the ``extra data'' to be deal with? I mean mmap() has to export part of old memory as the ``extra data''. If you think OK, I'll fill the ``extra data'' with 0 in case of read method. If not OK, I'll fill with the corresponding part of old memory. I think the two having different contents violates the principle of least surprise. I think exporting the old memory as the ``extra data'' is the least surprising and the easiest way to go. I don't mind filling the extra data with zero's but I don't see the point. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: OK, rigorously, suceess or faliure of the requested free pages allocation depends on actual memory layout at the 2nd kernel boot. To increase the possibility of allocating memory, we have no method but reserve more memory for the 2nd kernel now. Good enough. If there are fragmentation issues that cause allocation problems on larger boxes we can use vmalloc and remap_vmalloc_range, but we certainly don't need to start there. Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or an 8KiBP allocation. Aka order 0 or order 1. Adding more memory is also useful. It is important in general to keep the amount of memory needed for the kdump kernel low. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Wed, Mar 20, 2013 at 01:55:55PM -0700, Eric W. Biederman wrote: [..] If core counts on the high end do more than double every 2 years we might have a problem. Otherwise making everything mmapable seems easy and sound. We already have mechanism to translate file offset into actual physical address where data is. So if we can't allocate one contiguous chunk of memory for notes, we should be able to break it down into multiple page aligned areas and map offset into respective discontiguous areas using vmcore_list. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: OK, rigorously, suceess or faliure of the requested free pages allocation depends on actual memory layout at the 2nd kernel boot. To increase the possibility of allocating memory, we have no method but reserve more memory for the 2nd kernel now. Good enough. If there are fragmentation issues that cause allocation problems on larger boxes we can use vmalloc and remap_vmalloc_range, but we certainly don't need to start there. Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or an 8KiBP allocation. Aka order 0 or order 1. Actually we are already handling the large SGI machines so we need to plan for 4096 cpus now while we write these patches. vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what we should probaly use. Alternatively why not allocate everything in 4K pages and use vmcore_list to map offset into right addresses and call remap_pfn_range() on these addresses. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Wed, Mar 20, 2013 at 11:29:05PM -0700, Eric W. Biederman wrote: [..] Preserving the actual PT_LOAD segments p_paddr and p_memsz values is important. p_offset we can change as much as we want. Which means there can be logical holes in the file between PT_LOAD segments, where we put the extra data needed to keep everything page aligned. Agreed. If one modifies p_paddr then one will have to modify p-vaddr too. And user space tools look at p-vaddr and where is the corresponding physical address. Keeping p_vaddr and p_paddr intact makes sense. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 12:07:12AM -0700, Eric W. Biederman wrote: [..] I think the two having different contents violates the principle of least surprise. I think exporting the old memory as the ``extra data'' is the least surprising and the easiest way to go. I don't mind filling the extra data with zero's but I don't see the point. I think only question would be if there is a problem in reading memory areas which BIOS has kept reserved or possibly not exported. Are there any surprises to be expected. (machines reboots while trying to reboot a particular memory location etc). So trying to zero the extra data can make theoritically make it somewhat safer. So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. And in practice I am not expecting many PT_LOAD ranges which are unaligned. Just few. And allocating a few 4K pages should not be a big deal. And vmcore_list will help us again map whether pfn lies in old memory or new memory. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: [..] So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. Copying from old memory might spring surprises w.r.t hw poisoned pages. I guess we will have to disable MCE, read page, enable it back or something like that to take care of these issues. In the past we have recommended makedumpfile to be careful, look at struct pages and make sure we are not reading poisoned pages. But vmcore itself is reading old memory and can run into this issue too. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Vivek Goyal vgo...@redhat.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 11:27:51 -0400 On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: [..] So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. Copying from old memory might spring surprises w.r.t hw poisoned pages. I guess we will have to disable MCE, read page, enable it back or something like that to take care of these issues. In the past we have recommended makedumpfile to be careful, look at struct pages and make sure we are not reading poisoned pages. But vmcore itself is reading old memory and can run into this issue too. Yes, that has been already implemented in makedumpfile. Not only copying, but also mmaping poisoned pages might be problematic due to hardware cache prefetch performed by creation of page table to the poisoned pages. Or MCE disables the prefetch? I'm not sure but I'll investigate this. makedumpfile might also take care of calling mmap. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Vivek Goyal vgo...@redhat.com writes: On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: [..] So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. Copying from old memory might spring surprises w.r.t hw poisoned pages. I guess we will have to disable MCE, read page, enable it back or something like that to take care of these issues. In the past we have recommended makedumpfile to be careful, look at struct pages and make sure we are not reading poisoned pages. But vmcore itself is reading old memory and can run into this issue too. Vivek you are overthinking this. If there are issues with reading partially exported pages we should fix them in kexec-tools or in the kernel where the data is exported. In the examples given in the patch what we were looking at were cases where the BIOS rightly or wrongly was saying kernel this is my memory stay off. But it was all perfectly healthy memory. /proc/vmcore is a simple data dumper and prettifier. Let's keep it that way so that we can predict how it will act when we feed it information. /proc/vmcore should not be worrying about or covering up sins elsewhere in the system. At the level of /proc/vmcore we may want to do something about ensuring MCE's don't kill us. But that is an orthogonal problem. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Thu, 21 Mar 2013 17:54:22 -0700 Vivek Goyal vgo...@redhat.com writes: On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote: [..] So if starting or end address of PT_LOAD header is not aligned, why not we simply allocate a page. Copy the relevant data from old memory, fill rest with zero. That way mmap and read view will be same. There will be no surprises w.r.t reading old kernel memory beyond what's specified by the headers. Copying from old memory might spring surprises w.r.t hw poisoned pages. I guess we will have to disable MCE, read page, enable it back or something like that to take care of these issues. In the past we have recommended makedumpfile to be careful, look at struct pages and make sure we are not reading poisoned pages. But vmcore itself is reading old memory and can run into this issue too. Vivek you are overthinking this. If there are issues with reading partially exported pages we should fix them in kexec-tools or in the kernel where the data is exported. In the examples given in the patch what we were looking at were cases where the BIOS rightly or wrongly was saying kernel this is my memory stay off. But it was all perfectly healthy memory. /proc/vmcore is a simple data dumper and prettifier. Let's keep it that way so that we can predict how it will act when we feed it information. /proc/vmcore should not be worrying about or covering up sins elsewhere in the system. At the level of /proc/vmcore we may want to do something about ensuring MCE's don't kill us. But that is an orthogonal problem. This is the part of old memory /proc/vmcore must read at its initialization to generate its meta data, i.e. ELF header, program header table and ELF note segments. Other memory chunks are part makedumpfile should decide whether to read or avoid. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke writes: > From: "Eric W. Biederman" > Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s > page-size boundary requirement > Date: Wed, 20 Mar 2013 13:55:55 -0700 > >> Vivek Goyal writes: >> >>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: >>>> HATAYAMA Daisuke writes: >>>> >>>> > If there's some vmcore object that doesn't satisfy page-size boundary >>>> > requirement, remap_pfn_range() fails to remap it to user-space. >>>> > >>>> > Objects that posisbly don't satisfy the requirement are ELF note >>>> > segments only. The memory chunks corresponding to PT_LOAD entries are >>>> > guaranteed to satisfy page-size boundary requirement by the copy from >>>> > old memory to buffer in 2nd kernel done in later patch. >>>> > >>>> > This patch doesn't copy each note segment into the 2nd kernel since >>>> > they amount to so large in total if there are multiple CPUs. For >>>> > example, current maximum number of CPUs in x86_64 is 5120, where note >>>> > segments exceed 1MB with NT_PRSTATUS only. >>>> >>>> So you require the first kernel to reserve an additional 20MB, instead >>>> of just 1.6MB. 336 bytes versus 4096 bytes. >>>> >>>> That seems like completely the wrong tradeoff in memory consumption, >>>> filesize, and backwards compatibility. >>> >>> Agreed. >>> >>> So we already copy ELF headers in second kernel's memory. If we start >>> copying notes too, then both headers and notes will support mmap(). >> >> The only real is it could be a bit tricky to allocate all of the memory >> for the notes section on high cpu count systems in a single allocation. >> > > Do you mean it's getting difficult on many-cpus machine to get free > pages consequtive enough to be able to cover all the notes? > > If so, is it necessary to think about any care to it in the next > patch? Or, should it be pending for now? I meant that in general allocations > PAGE_SIZE get increasingly unreliable the larger they are. And on large cpu count machines we are having larger allocations. Of course large cpu count machines typically have more memory so the odds go up. Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64 machine certainly succeeded in an order 11 allocation during boot so I don't expect any real problems with a 2MiB allocation but it is something to keep an eye on with kernel memory. >>> For mmap() of memory regions which are not page aligned, we can map >>> extra bytes (as you suggested in one of the mails). Given the fact >>> that we have one ELF header for every memory range, we can always modify >>> the file offset where phdr data is starting to make space for mapping >>> of extra bytes. >> >> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to >> make mmap work. >> > > OK, your conclusion is the 1st version is better than the 2nd. > > The purpose of this design was not to export anything but dump target > memory to user-space from /proc/vmcore. I think it better to do it if > possible. it's possible for read interface to fill the corresponding > part with 0. But it's impossible for mmap interface to data on modify > old memory. In practice someone lied. You can't have a chunk of memory that is smaller than page size. So I don't see it doing any harm to export the memory that is there but some silly system lied to us about. > Do you agree two vmcores seen from read and mmap interfaces are no > longer coincide? That is an interesting point. I don't think there is any point in having read and mmap disagree, that just seems to lead to complications, especially since the data we are talking about adding is actually memory contents. I do think it makes sense to have logical chunks of the file that are not covered by PT_LOAD segments. Logical chunks like the leading edge of a page inside of which a PT_LOAD segment starts, and the trailing edge of a page in which a PT_LOAD segment ends. Implementaton wise this would mean extending the struct vmcore entry to cover missing bits, by rounding down the start address and rounding up the end address to the nearest page size boundary. The generated PT_LOAD segment would then have it's file offset adjusted to point skip the bytes of the page that are there but we don't care about. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: "Eric W. Biederman" Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 13:55:55 -0700 > Vivek Goyal writes: > >> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: >>> HATAYAMA Daisuke writes: >>> >>> > If there's some vmcore object that doesn't satisfy page-size boundary >>> > requirement, remap_pfn_range() fails to remap it to user-space. >>> > >>> > Objects that posisbly don't satisfy the requirement are ELF note >>> > segments only. The memory chunks corresponding to PT_LOAD entries are >>> > guaranteed to satisfy page-size boundary requirement by the copy from >>> > old memory to buffer in 2nd kernel done in later patch. >>> > >>> > This patch doesn't copy each note segment into the 2nd kernel since >>> > they amount to so large in total if there are multiple CPUs. For >>> > example, current maximum number of CPUs in x86_64 is 5120, where note >>> > segments exceed 1MB with NT_PRSTATUS only. >>> >>> So you require the first kernel to reserve an additional 20MB, instead >>> of just 1.6MB. 336 bytes versus 4096 bytes. >>> >>> That seems like completely the wrong tradeoff in memory consumption, >>> filesize, and backwards compatibility. >> >> Agreed. >> >> So we already copy ELF headers in second kernel's memory. If we start >> copying notes too, then both headers and notes will support mmap(). > > The only real is it could be a bit tricky to allocate all of the memory > for the notes section on high cpu count systems in a single allocation. > Do you mean it's getting difficult on many-cpus machine to get free pages consequtive enough to be able to cover all the notes? If so, is it necessary to think about any care to it in the next patch? Or, should it be pending for now? >> For mmap() of memory regions which are not page aligned, we can map >> extra bytes (as you suggested in one of the mails). Given the fact >> that we have one ELF header for every memory range, we can always modify >> the file offset where phdr data is starting to make space for mapping >> of extra bytes. > > Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to > make mmap work. > OK, your conclusion is the 1st version is better than the 2nd. The purpose of this design was not to export anything but dump target memory to user-space from /proc/vmcore. I think it better to do it if possible. it's possible for read interface to fill the corresponding part with 0. But it's impossible for mmap interface to data on modify old memory. Do you agree two vmcores seen from read and mmap interfaces are no longer coincide? Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Vivek Goyal writes: > On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: >> HATAYAMA Daisuke writes: >> >> > If there's some vmcore object that doesn't satisfy page-size boundary >> > requirement, remap_pfn_range() fails to remap it to user-space. >> > >> > Objects that posisbly don't satisfy the requirement are ELF note >> > segments only. The memory chunks corresponding to PT_LOAD entries are >> > guaranteed to satisfy page-size boundary requirement by the copy from >> > old memory to buffer in 2nd kernel done in later patch. >> > >> > This patch doesn't copy each note segment into the 2nd kernel since >> > they amount to so large in total if there are multiple CPUs. For >> > example, current maximum number of CPUs in x86_64 is 5120, where note >> > segments exceed 1MB with NT_PRSTATUS only. >> >> So you require the first kernel to reserve an additional 20MB, instead >> of just 1.6MB. 336 bytes versus 4096 bytes. >> >> That seems like completely the wrong tradeoff in memory consumption, >> filesize, and backwards compatibility. > > Agreed. > > So we already copy ELF headers in second kernel's memory. If we start > copying notes too, then both headers and notes will support mmap(). The only real is it could be a bit tricky to allocate all of the memory for the notes section on high cpu count systems in a single allocation. > For mmap() of memory regions which are not page aligned, we can map > extra bytes (as you suggested in one of the mails). Given the fact > that we have one ELF header for every memory range, we can always modify > the file offset where phdr data is starting to make space for mapping > of extra bytes. Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to make mmap work. > That way whole of vmcore should be mmappable and user does not have > to worry about reading part of the file and mmaping the rest. That sounds simplest. If core counts on the high end do more than double every 2 years we might have a problem. Otherwise making everything mmapable seems easy and sound. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: > HATAYAMA Daisuke writes: > > > If there's some vmcore object that doesn't satisfy page-size boundary > > requirement, remap_pfn_range() fails to remap it to user-space. > > > > Objects that posisbly don't satisfy the requirement are ELF note > > segments only. The memory chunks corresponding to PT_LOAD entries are > > guaranteed to satisfy page-size boundary requirement by the copy from > > old memory to buffer in 2nd kernel done in later patch. > > > > This patch doesn't copy each note segment into the 2nd kernel since > > they amount to so large in total if there are multiple CPUs. For > > example, current maximum number of CPUs in x86_64 is 5120, where note > > segments exceed 1MB with NT_PRSTATUS only. > > So you require the first kernel to reserve an additional 20MB, instead > of just 1.6MB. 336 bytes versus 4096 bytes. > > That seems like completely the wrong tradeoff in memory consumption, > filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. That way whole of vmcore should be mmappable and user does not have to worry about reading part of the file and mmaping the rest. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Tue, Mar 19, 2013 at 01:02:29PM -0700, Andrew Morton wrote: > On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke > wrote: > > > If there's some vmcore object that doesn't satisfy page-size boundary > > requirement, remap_pfn_range() fails to remap it to user-space. > > > > Objects that posisbly don't satisfy the requirement are ELF note > > segments only. The memory chunks corresponding to PT_LOAD entries are > > guaranteed to satisfy page-size boundary requirement by the copy from > > old memory to buffer in 2nd kernel done in later patch. > > > > This patch doesn't copy each note segment into the 2nd kernel since > > they amount to so large in total if there are multiple CPUs. For > > example, current maximum number of CPUs in x86_64 is 5120, where note > > segments exceed 1MB with NT_PRSTATUS only. > > I don't really understand this. Why does the number of or size of > note segments affect their alignment? > > > --- a/fs/proc/vmcore.c > > +++ b/fs/proc/vmcore.c > > @@ -38,6 +38,8 @@ static u64 vmcore_size; > > > > static struct proc_dir_entry *proc_vmcore = NULL; > > > > +static bool support_mmap_vmcore; > > This is quite regrettable. It means that on some kernels/machines, > mmap(vmcore) simply won't work. This means that people might write > code which works for them, but which will fail for others when deployed > on a small number of machines. > > Can we avoid this? Why can't we just copy the notes even if there are > a large number of them? Actually initially he implemented copying notes to second kernel and I suggested to go other way (Tried too hard to save memory in second kernel). I guess it was not a good idea and copying notes keeps it simple. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Tue, Mar 19, 2013 at 01:02:29PM -0700, Andrew Morton wrote: On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. I don't really understand this. Why does the number of or size of note segments affect their alignment? --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -38,6 +38,8 @@ static u64 vmcore_size; static struct proc_dir_entry *proc_vmcore = NULL; +static bool support_mmap_vmcore; This is quite regrettable. It means that on some kernels/machines, mmap(vmcore) simply won't work. This means that people might write code which works for them, but which will fail for others when deployed on a small number of machines. Can we avoid this? Why can't we just copy the notes even if there are a large number of them? Actually initially he implemented copying notes to second kernel and I suggested to go other way (Tried too hard to save memory in second kernel). I guess it was not a good idea and copying notes keeps it simple. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. That way whole of vmcore should be mmappable and user does not have to worry about reading part of the file and mmaping the rest. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Vivek Goyal vgo...@redhat.com writes: On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). The only real is it could be a bit tricky to allocate all of the memory for the notes section on high cpu count systems in a single allocation. For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to make mmap work. That way whole of vmcore should be mmappable and user does not have to worry about reading part of the file and mmaping the rest. That sounds simplest. If core counts on the high end do more than double every 2 years we might have a problem. Otherwise making everything mmapable seems easy and sound. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 13:55:55 -0700 Vivek Goyal vgo...@redhat.com writes: On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). The only real is it could be a bit tricky to allocate all of the memory for the notes section on high cpu count systems in a single allocation. Do you mean it's getting difficult on many-cpus machine to get free pages consequtive enough to be able to cover all the notes? If so, is it necessary to think about any care to it in the next patch? Or, should it be pending for now? For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to make mmap work. OK, your conclusion is the 1st version is better than the 2nd. The purpose of this design was not to export anything but dump target memory to user-space from /proc/vmcore. I think it better to do it if possible. it's possible for read interface to fill the corresponding part with 0. But it's impossible for mmap interface to data on modify old memory. Do you agree two vmcores seen from read and mmap interfaces are no longer coincide? Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: From: Eric W. Biederman ebied...@xmission.com Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement Date: Wed, 20 Mar 2013 13:55:55 -0700 Vivek Goyal vgo...@redhat.com writes: On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Agreed. So we already copy ELF headers in second kernel's memory. If we start copying notes too, then both headers and notes will support mmap(). The only real is it could be a bit tricky to allocate all of the memory for the notes section on high cpu count systems in a single allocation. Do you mean it's getting difficult on many-cpus machine to get free pages consequtive enough to be able to cover all the notes? If so, is it necessary to think about any care to it in the next patch? Or, should it be pending for now? I meant that in general allocations PAGE_SIZE get increasingly unreliable the larger they are. And on large cpu count machines we are having larger allocations. Of course large cpu count machines typically have more memory so the odds go up. Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64 machine certainly succeeded in an order 11 allocation during boot so I don't expect any real problems with a 2MiB allocation but it is something to keep an eye on with kernel memory. For mmap() of memory regions which are not page aligned, we can map extra bytes (as you suggested in one of the mails). Given the fact that we have one ELF header for every memory range, we can always modify the file offset where phdr data is starting to make space for mapping of extra bytes. Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to make mmap work. OK, your conclusion is the 1st version is better than the 2nd. The purpose of this design was not to export anything but dump target memory to user-space from /proc/vmcore. I think it better to do it if possible. it's possible for read interface to fill the corresponding part with 0. But it's impossible for mmap interface to data on modify old memory. In practice someone lied. You can't have a chunk of memory that is smaller than page size. So I don't see it doing any harm to export the memory that is there but some silly system lied to us about. Do you agree two vmcores seen from read and mmap interfaces are no longer coincide? That is an interesting point. I don't think there is any point in having read and mmap disagree, that just seems to lead to complications, especially since the data we are talking about adding is actually memory contents. I do think it makes sense to have logical chunks of the file that are not covered by PT_LOAD segments. Logical chunks like the leading edge of a page inside of which a PT_LOAD segment starts, and the trailing edge of a page in which a PT_LOAD segment ends. Implementaton wise this would mean extending the struct vmcore entry to cover missing bits, by rounding down the start address and rounding up the end address to the nearest page size boundary. The generated PT_LOAD segment would then have it's file offset adjusted to point skip the bytes of the page that are there but we don't care about. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke writes: > If there's some vmcore object that doesn't satisfy page-size boundary > requirement, remap_pfn_range() fails to remap it to user-space. > > Objects that posisbly don't satisfy the requirement are ELF note > segments only. The memory chunks corresponding to PT_LOAD entries are > guaranteed to satisfy page-size boundary requirement by the copy from > old memory to buffer in 2nd kernel done in later patch. > > This patch doesn't copy each note segment into the 2nd kernel since > they amount to so large in total if there are multiple CPUs. For > example, current maximum number of CPUs in x86_64 is 5120, where note > segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Andrew Morton writes: > On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke > wrote: > >> If there's some vmcore object that doesn't satisfy page-size boundary >> requirement, remap_pfn_range() fails to remap it to user-space. >> >> Objects that posisbly don't satisfy the requirement are ELF note >> segments only. The memory chunks corresponding to PT_LOAD entries are >> guaranteed to satisfy page-size boundary requirement by the copy from >> old memory to buffer in 2nd kernel done in later patch. >> >> This patch doesn't copy each note segment into the 2nd kernel since >> they amount to so large in total if there are multiple CPUs. For >> example, current maximum number of CPUs in x86_64 is 5120, where note >> segments exceed 1MB with NT_PRSTATUS only. > > I don't really understand this. Why does the number of or size of > note segments affect their alignment? > >> --- a/fs/proc/vmcore.c >> +++ b/fs/proc/vmcore.c >> @@ -38,6 +38,8 @@ static u64 vmcore_size; >> >> static struct proc_dir_entry *proc_vmcore = NULL; >> >> +static bool support_mmap_vmcore; > > This is quite regrettable. It means that on some kernels/machines, > mmap(vmcore) simply won't work. This means that people might write > code which works for them, but which will fail for others when deployed > on a small number of machines. > > Can we avoid this? Why can't we just copy the notes even if there are > a large number of them? Yes. If it simplifies things I don't see a need to support mmapping everything. But even there I don't see much of an issue. Today we allocate a buffer to hold the ELF header program headers and the note segment, and we could easily allocate that buffer in such a way to make it mmapable. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke wrote: > If there's some vmcore object that doesn't satisfy page-size boundary > requirement, remap_pfn_range() fails to remap it to user-space. > > Objects that posisbly don't satisfy the requirement are ELF note > segments only. The memory chunks corresponding to PT_LOAD entries are > guaranteed to satisfy page-size boundary requirement by the copy from > old memory to buffer in 2nd kernel done in later patch. > > This patch doesn't copy each note segment into the 2nd kernel since > they amount to so large in total if there are multiple CPUs. For > example, current maximum number of CPUs in x86_64 is 5120, where note > segments exceed 1MB with NT_PRSTATUS only. I don't really understand this. Why does the number of or size of note segments affect their alignment? > --- a/fs/proc/vmcore.c > +++ b/fs/proc/vmcore.c > @@ -38,6 +38,8 @@ static u64 vmcore_size; > > static struct proc_dir_entry *proc_vmcore = NULL; > > +static bool support_mmap_vmcore; This is quite regrettable. It means that on some kernels/machines, mmap(vmcore) simply won't work. This means that people might write code which works for them, but which will fail for others when deployed on a small number of machines. Can we avoid this? Why can't we just copy the notes even if there are a large number of them? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. I don't really understand this. Why does the number of or size of note segments affect their alignment? --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -38,6 +38,8 @@ static u64 vmcore_size; static struct proc_dir_entry *proc_vmcore = NULL; +static bool support_mmap_vmcore; This is quite regrettable. It means that on some kernels/machines, mmap(vmcore) simply won't work. This means that people might write code which works for them, but which will fail for others when deployed on a small number of machines. Can we avoid this? Why can't we just copy the notes even if there are a large number of them? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Andrew Morton a...@linux-foundation.org writes: On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. I don't really understand this. Why does the number of or size of note segments affect their alignment? --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -38,6 +38,8 @@ static u64 vmcore_size; static struct proc_dir_entry *proc_vmcore = NULL; +static bool support_mmap_vmcore; This is quite regrettable. It means that on some kernels/machines, mmap(vmcore) simply won't work. This means that people might write code which works for them, but which will fail for others when deployed on a small number of machines. Can we avoid this? Why can't we just copy the notes even if there are a large number of them? Yes. If it simplifies things I don't see a need to support mmapping everything. But even there I don't see much of an issue. Today we allocate a buffer to hold the ELF header program headers and the note segment, and we could easily allocate that buffer in such a way to make it mmapable. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes: If there's some vmcore object that doesn't satisfy page-size boundary requirement, remap_pfn_range() fails to remap it to user-space. Objects that posisbly don't satisfy the requirement are ELF note segments only. The memory chunks corresponding to PT_LOAD entries are guaranteed to satisfy page-size boundary requirement by the copy from old memory to buffer in 2nd kernel done in later patch. This patch doesn't copy each note segment into the 2nd kernel since they amount to so large in total if there are multiple CPUs. For example, current maximum number of CPUs in x86_64 is 5120, where note segments exceed 1MB with NT_PRSTATUS only. So you require the first kernel to reserve an additional 20MB, instead of just 1.6MB. 336 bytes versus 4096 bytes. That seems like completely the wrong tradeoff in memory consumption, filesize, and backwards compatibility. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/