Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-22 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 10:49:29 -0400

> On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
>> HATAYAMA Daisuke  writes:
>> 
>> > OK, rigorously, suceess or faliure of the requested free pages
>> > allocation depends on actual memory layout at the 2nd kernel boot. To
>> > increase the possibility of allocating memory, we have no method but
>> > reserve more memory for the 2nd kernel now.
>> 
>> Good enough.   If there are fragmentation issues that cause allocation
>> problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
>> we certainly don't need to start there.
>> 
>> Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
>> an 8KiBP allocation.  Aka order 0 or order 1.
>> 
> 
> Actually we are already handling the large SGI machines so we need
> to plan for 4096 cpus now while we write these patches.
> 
> vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
> we should probaly use.
> 
> Alternatively why not allocate everything in 4K pages and use vmcore_list
> to map offset into right addresses and call remap_pfn_range() on these
> addresses.

I have an introductory question about design of vmalloc. My
understanding is that vmalloc allocates *pages* enough to cover a
requested size and returns the first corresponding virtual address.
So, the address returned is inherently always page-size aligned.

It looks like vmalloc does so in the current implementation, but I
don't know older implementations and I cannot make sure this is
guranteed in vmalloc's interface. There's the comment explaing the
interface of vmalloc as below, but it seems to me a little vague in
that it doesn't say clearly what's is returned as an address.

/**
 *  vmalloc  -  allocate virtually contiguous memory
 *  @size:  allocation size
 *  Allocate enough pages to cover @size from the page level
 *  allocator and map them into contiguous kernel virtual space.
 *
 *  For tight control over page level allocator and protection flags
 *  use __vmalloc() instead.
 */
void *vmalloc(unsigned long size)
{
return __vmalloc_node_flags(size, NUMA_NO_NODE,
GFP_KERNEL | __GFP_HIGHMEM);
}
EXPORT_SYMBOL(vmalloc);

BTW, simple test module code also shows they returns page-size aligned
objects, where 1-byte objects are allocated 12-times.

$ dmesg | tail -n 12
[3552817.290982] test: objects[0] = c960c000
[3552817.291197] test: objects[1] = c960e000
[3552817.291379] test: objects[2] = c967d000
[3552817.291566] test: objects[3] = c90010f99000
[3552817.291833] test: objects[4] = c90010f9b000
[3552817.292015] test: objects[5] = c90010f9d000
[3552817.292207] test: objects[6] = c90010f9f000
[3552817.292386] test: objects[7] = c90010fa1000
[3552817.292574] test: objects[8] = c90010fa3000
[3552817.292785] test: objects[9] = c90010fa5000
[3552817.292964] test: objects[10] = c90010fa7000
[3552817.293143] test: objects[11] = c90010fa9000

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-22 Thread HATAYAMA Daisuke
From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 10:49:29 -0400

 On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  OK, rigorously, suceess or faliure of the requested free pages
  allocation depends on actual memory layout at the 2nd kernel boot. To
  increase the possibility of allocating memory, we have no method but
  reserve more memory for the 2nd kernel now.
 
 Good enough.   If there are fragmentation issues that cause allocation
 problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
 we certainly don't need to start there.
 
 Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
 an 8KiBP allocation.  Aka order 0 or order 1.
 
 
 Actually we are already handling the large SGI machines so we need
 to plan for 4096 cpus now while we write these patches.
 
 vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
 we should probaly use.
 
 Alternatively why not allocate everything in 4K pages and use vmcore_list
 to map offset into right addresses and call remap_pfn_range() on these
 addresses.

I have an introductory question about design of vmalloc. My
understanding is that vmalloc allocates *pages* enough to cover a
requested size and returns the first corresponding virtual address.
So, the address returned is inherently always page-size aligned.

It looks like vmalloc does so in the current implementation, but I
don't know older implementations and I cannot make sure this is
guranteed in vmalloc's interface. There's the comment explaing the
interface of vmalloc as below, but it seems to me a little vague in
that it doesn't say clearly what's is returned as an address.

/**
 *  vmalloc  -  allocate virtually contiguous memory
 *  @size:  allocation size
 *  Allocate enough pages to cover @size from the page level
 *  allocator and map them into contiguous kernel virtual space.
 *
 *  For tight control over page level allocator and protection flags
 *  use __vmalloc() instead.
 */
void *vmalloc(unsigned long size)
{
return __vmalloc_node_flags(size, NUMA_NO_NODE,
GFP_KERNEL | __GFP_HIGHMEM);
}
EXPORT_SYMBOL(vmalloc);

BTW, simple test module code also shows they returns page-size aligned
objects, where 1-byte objects are allocated 12-times.

$ dmesg | tail -n 12
[3552817.290982] test: objects[0] = c960c000
[3552817.291197] test: objects[1] = c960e000
[3552817.291379] test: objects[2] = c967d000
[3552817.291566] test: objects[3] = c90010f99000
[3552817.291833] test: objects[4] = c90010f9b000
[3552817.292015] test: objects[5] = c90010f9d000
[3552817.292207] test: objects[6] = c90010f9f000
[3552817.292386] test: objects[7] = c90010fa1000
[3552817.292574] test: objects[8] = c90010fa3000
[3552817.292785] test: objects[9] = c90010fa5000
[3552817.292964] test: objects[10] = c90010fa7000
[3552817.293143] test: objects[11] = c90010fa9000

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: "Eric W. Biederman" 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 17:54:22 -0700

> Vivek Goyal  writes:
> 
>> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>>
>> [..]
>>> So if starting or end address of PT_LOAD header is not aligned, why
>>> not we simply allocate a page. Copy the relevant data from old memory,
>>> fill rest with zero. That way mmap and read view will be same. There
>>> will be no surprises w.r.t reading old kernel memory beyond what's 
>>> specified by the headers.
>>
>> Copying from old memory might spring surprises w.r.t hw poisoned
>> pages. I guess we will have to disable MCE, read page, enable it
>> back or something like that to take care of these issues.
>>
>> In the past we have recommended makedumpfile to be careful, look
>> at struct pages and make sure we are not reading poisoned pages.
>> But vmcore itself is reading old memory and can run into this
>> issue too.
> 
> Vivek you are overthinking this.
> 
> If there are issues with reading partially exported pages we should
> fix them in kexec-tools or in the kernel where the data is exported.
> 
> In the examples given in the patch what we were looking at were cases
> where the BIOS rightly or wrongly was saying kernel this is my memory
> stay off.  But it was all perfectly healthy memory.
> 
> /proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
> way so that we can predict how it will act when we feed it information.
> /proc/vmcore should not be worrying about or covering up sins elsewhere
> in the system.
> 
> At the level of /proc/vmcore we may want to do something about ensuring
> MCE's don't kill us.  But that is an orthogonal problem.

This is the part of old memory /proc/vmcore must read at its
initialization to generate its meta data, i.e. ELF header, program
header table and ELF note segments. Other memory chunks are part
makedumpfile should decide whether to read or avoid.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
Vivek Goyal  writes:

> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's 
>> specified by the headers.
>
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
>
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.

Vivek you are overthinking this.

If there are issues with reading partially exported pages we should
fix them in kexec-tools or in the kernel where the data is exported.

In the examples given in the patch what we were looking at were cases
where the BIOS rightly or wrongly was saying kernel this is my memory
stay off.  But it was all perfectly healthy memory.

/proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
way so that we can predict how it will act when we feed it information.
/proc/vmcore should not be worrying about or covering up sins elsewhere
in the system.

At the level of /proc/vmcore we may want to do something about ensuring
MCE's don't kill us.  But that is an orthogonal problem.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Vivek Goyal 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 11:27:51 -0400

> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
> 
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's 
>> specified by the headers.
> 
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
> 
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.

Yes, that has been already implemented in makedumpfile.

Not only copying, but also mmaping poisoned pages might be problematic
due to hardware cache prefetch performed by creation of page table to
the poisoned pages. Or MCE disables the prefetch? I'm not sure but
I'll investigate this. makedumpfile might also take care of calling
mmap.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:

[..]
> So if starting or end address of PT_LOAD header is not aligned, why
> not we simply allocate a page. Copy the relevant data from old memory,
> fill rest with zero. That way mmap and read view will be same. There
> will be no surprises w.r.t reading old kernel memory beyond what's 
> specified by the headers.

Copying from old memory might spring surprises w.r.t hw poisoned
pages. I guess we will have to disable MCE, read page, enable it
back or something like that to take care of these issues.

In the past we have recommended makedumpfile to be careful, look
at struct pages and make sure we are not reading poisoned pages.
But vmcore itself is reading old memory and can run into this
issue too.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 12:07:12AM -0700, Eric W. Biederman wrote:

[..]
> I think the two having different contents violates the principle of
> least surprise.
> 
> I think exporting the old memory as the ``extra data'' is the least
> surprising and the easiest way to go.
> 
> I don't mind filling the extra data with zero's but I don't see the
> point.

I think only question would be if there is a problem in reading memory
areas which BIOS has kept reserved or possibly not exported. Are there
any surprises to be expected. (machines reboots while trying to reboot
a particular memory location etc).

So trying to zero the extra data can make theoritically make it somewhat
safer.

So if starting or end address of PT_LOAD header is not aligned, why
not we simply allocate a page. Copy the relevant data from old memory,
fill rest with zero. That way mmap and read view will be same. There
will be no surprises w.r.t reading old kernel memory beyond what's 
specified by the headers.

And in practice I am not expecting many PT_LOAD ranges which are unaligned.
Just few. And allocating a few 4K pages should not be a big deal.

And vmcore_list will help us again map whether pfn lies in old memory
or new memory. 

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Wed, Mar 20, 2013 at 11:29:05PM -0700, Eric W. Biederman wrote:

[..]
> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
> important. p_offset we can change as much as we want.  Which means there
> can be logical holes in the file between PT_LOAD segments, where we put
> the extra data needed to keep everything page aligned.

Agreed. If one modifies p_paddr then one will have to modify p->vaddr
too. And user space tools look at p->vaddr and where is the corresponding
physical address. Keeping p_vaddr and p_paddr intact makes sense.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke  writes:
> 
> > OK, rigorously, suceess or faliure of the requested free pages
> > allocation depends on actual memory layout at the 2nd kernel boot. To
> > increase the possibility of allocating memory, we have no method but
> > reserve more memory for the 2nd kernel now.
> 
> Good enough.   If there are fragmentation issues that cause allocation
> problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
> we certainly don't need to start there.
> 
> Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
> an 8KiBP allocation.  Aka order 0 or order 1.
> 

Actually we are already handling the large SGI machines so we need
to plan for 4096 cpus now while we write these patches.

vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
we should probaly use.

Alternatively why not allocate everything in 4K pages and use vmcore_list
to map offset into right addresses and call remap_pfn_range() on these
addresses.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Wed, Mar 20, 2013 at 01:55:55PM -0700, Eric W. Biederman wrote:

[..]
> If core counts on the high end do more than double every 2 years we
> might have a problem.  Otherwise making everything mmapable seems easy
> and sound.

We already have mechanism to translate file offset into actual physical
address where data is. So if we can't allocate one contiguous chunk of
memory for notes, we should be able to break it down into multiple
page aligned areas and map offset into respective discontiguous areas
using vmcore_list.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:

> OK, rigorously, suceess or faliure of the requested free pages
> allocation depends on actual memory layout at the 2nd kernel boot. To
> increase the possibility of allocating memory, we have no method but
> reserve more memory for the 2nd kernel now.

Good enough.   If there are fragmentation issues that cause allocation
problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
we certainly don't need to start there.

Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
an 8KiBP allocation.  Aka order 0 or order 1.

Adding more memory is also useful.  It is important in general to keep
the amount of memory needed for the kdump kernel low.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:

> From: "Eric W. Biederman" 
> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
> page-size boundary requirement
> Date: Wed, 20 Mar 2013 23:29:05 -0700
>
>> HATAYAMA Daisuke  writes:
>>>
>>> Do you mean for each range represented by each PT_LOAD entry, say:
>>>
>>>   [p_paddr, p_paddr + p_memsz]
>>>
>>> extend it as:
>>>
>>>   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>>>
>>> not only objects in vmcore_list, but also updating p_paddr and p_memsz
>>> members themselves of each PT_LOAD entry? In other words, there's no
>>> new holes not referenced by any PT_LOAD entry since the regions
>>> referenced by some PT_LOAD entry, themselves are extended.
>> 
>> No.  p_paddr and p_memsz as exported should remain the same.
>> I am suggesting that we change p_offset.
>> 
>> I am suggesting to include the data in the file as if we had changed
>> p_paddr and p_memsz.
>> 
>>> Then, the vmcores seen from read and mmap methods are coincide in the
>>> direction of including both ranges
>>>
>>>   [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>>>
>>> and
>>>
>>>   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>>>
>>> are included in both vmcores seen from read and mmap methods, although
>>> they are originally not dump target memory, which you are not
>>> problematic for ease of implementation.
>>>
>>> Is there difference here from you understanding?
>> 
>> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
>> important. p_offset we can change as much as we want.  Which means there
>> can be logical holes in the file between PT_LOAD segments, where we put
>> the extra data needed to keep everything page aligned.
>> 
>
> So, I have to make the same question again. Is it OK if two vmcores
> are different? How do you intend the ``extra data'' to be deal with? I
> mean mmap() has to export part of old memory as the ``extra data''.
>
> If you think OK, I'll fill the ``extra data'' with 0 in case of read
> method. If not OK, I'll fill with the corresponding part of old
> memory.

I think the two having different contents violates the principle of
least surprise.

I think exporting the old memory as the ``extra data'' is the least
surprising and the easiest way to go.

I don't mind filling the extra data with zero's but I don't see the
point.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: "Eric W. Biederman" 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 23:29:05 -0700

> HATAYAMA Daisuke  writes:
>>
>> Do you mean for each range represented by each PT_LOAD entry, say:
>>
>>   [p_paddr, p_paddr + p_memsz]
>>
>> extend it as:
>>
>>   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>>
>> not only objects in vmcore_list, but also updating p_paddr and p_memsz
>> members themselves of each PT_LOAD entry? In other words, there's no
>> new holes not referenced by any PT_LOAD entry since the regions
>> referenced by some PT_LOAD entry, themselves are extended.
> 
> No.  p_paddr and p_memsz as exported should remain the same.
> I am suggesting that we change p_offset.
> 
> I am suggesting to include the data in the file as if we had changed
> p_paddr and p_memsz.
> 
>> Then, the vmcores seen from read and mmap methods are coincide in the
>> direction of including both ranges
>>
>>   [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>>
>> and
>>
>>   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>>
>> are included in both vmcores seen from read and mmap methods, although
>> they are originally not dump target memory, which you are not
>> problematic for ease of implementation.
>>
>> Is there difference here from you understanding?
> 
> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
> important. p_offset we can change as much as we want.  Which means there
> can be logical holes in the file between PT_LOAD segments, where we put
> the extra data needed to keep everything page aligned.
> 

So, I have to make the same question again. Is it OK if two vmcores
are different? How do you intend the ``extra data'' to be deal with? I
mean mmap() has to export part of old memory as the ``extra data''.

If you think OK, I'll fill the ``extra data'' with 0 in case of read
method. If not OK, I'll fill with the corresponding part of old
memory.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:
>
> Do you mean for each range represented by each PT_LOAD entry, say:
>
>   [p_paddr, p_paddr + p_memsz]
>
> extend it as:
>
>   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>
> not only objects in vmcore_list, but also updating p_paddr and p_memsz
> members themselves of each PT_LOAD entry? In other words, there's no
> new holes not referenced by any PT_LOAD entry since the regions
> referenced by some PT_LOAD entry, themselves are extended.

No.  p_paddr and p_memsz as exported should remain the same.
I am suggesting that we change p_offset.

I am suggesting to include the data in the file as if we had changed
p_paddr and p_memsz.

> Then, the vmcores seen from read and mmap methods are coincide in the
> direction of including both ranges
>
>   [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>
> and
>
>   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>
> are included in both vmcores seen from read and mmap methods, although
> they are originally not dump target memory, which you are not
> problematic for ease of implementation.
>
> Is there difference here from you understanding?

Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
important. p_offset we can change as much as we want.  Which means there
can be logical holes in the file between PT_LOAD segments, where we put
the extra data needed to keep everything page aligned.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: "Eric W. Biederman" 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 21:18:37 -0700

> HATAYAMA Daisuke  writes:
> 
>> From: "Eric W. Biederman" 
>> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify 
>> mmap()'s page-size boundary requirement
>> Date: Wed, 20 Mar 2013 13:55:55 -0700
>>
>>> Vivek Goyal  writes:
>>> 
>>>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>>>> HATAYAMA Daisuke  writes:
>>>>> 
>>>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>>>> >
>>>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>>>> > old memory to buffer in 2nd kernel done in later patch.
>>>>> >
>>>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>>>> > they amount to so large in total if there are multiple CPUs. For
>>>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>>>> > segments exceed 1MB with NT_PRSTATUS only.
>>>>> 
>>>>> So you require the first kernel to reserve an additional 20MB, instead
>>>>> of just 1.6MB.  336 bytes versus 4096 bytes.
>>>>> 
>>>>> That seems like completely the wrong tradeoff in memory consumption,
>>>>> filesize, and backwards compatibility.
>>>>
>>>> Agreed. 
>>>>
>>>> So we already copy ELF headers in second kernel's memory. If we start
>>>> copying notes too, then both headers and notes will support mmap().
>>> 
>>> The only real is it could be a bit tricky to allocate all of the memory
>>> for the notes section on high cpu count systems in a single allocation.
>>> 
>>
>> Do you mean it's getting difficult on many-cpus machine to get free
>> pages consequtive enough to be able to cover all the notes?
>>
>> If so, is it necessary to think about any care to it in the next
>> patch? Or, should it be pending for now?
> 
> I meant that in general allocations > PAGE_SIZE get increasingly
> unreliable the larger they are.  And on large cpu count machines we are
> having larger allocations.  Of course large cpu count machines typically
> have more memory so the odds go up.
> 
> Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
> machine certainly succeeded in an order 11 allocation during boot so I
> don't expect any real problems with a 2MiB allocation but it is
> something to keep an eye on with kernel memory.
> 

OK, rigorously, suceess or faliure of the requested free pages
allocation depends on actual memory layout at the 2nd kernel boot. To
increase the possibility of allocating memory, we have no method but
reserve more memory for the 2nd kernel now.

>>>> For mmap() of memory regions which are not page aligned, we can map
>>>> extra bytes (as you suggested in one of the mails). Given the fact
>>>> that we have one ELF header for every memory range, we can always modify
>>>> the file offset where phdr data is starting to make space for mapping
>>>> of extra bytes.
>>> 
>>> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
>>> make mmap work.
>>> 
>>
>> OK, your conclusion is the 1st version is better than the 2nd.
>>
>> The purpose of this design was not to export anything but dump target
>> memory to user-space from /proc/vmcore. I think it better to do it if
>> possible. it's possible for read interface to fill the corresponding
>> part with 0. But it's impossible for mmap interface to data on modify
>> old memory.
> 
> In practice someone lied.  You can't have a chunk of memory that is
> smaller than page size.  So I don't see it doing any harm to export
> the memory that is there but some silly system lied to us about.
> 
>> Do you agree two vmcores seen from read and mmap interfaces are no
>> longer coincide?
> 
> That is an interesting point.  I don't think there is any point in
> having read and mmap disagree, that just seems to lead to complications,
> especially since the data we are talking about adding is actually

Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Eric W. Biederman ebied...@xmission.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 21:18:37 -0700

 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
 From: Eric W. Biederman ebied...@xmission.com
 Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify 
 mmap()'s page-size boundary requirement
 Date: Wed, 20 Mar 2013 13:55:55 -0700

 Vivek Goyal vgo...@redhat.com writes:
 
 On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
 
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
 
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 So you require the first kernel to reserve an additional 20MB, instead
 of just 1.6MB.  336 bytes versus 4096 bytes.
 
 That seems like completely the wrong tradeoff in memory consumption,
 filesize, and backwards compatibility.

 Agreed. 

 So we already copy ELF headers in second kernel's memory. If we start
 copying notes too, then both headers and notes will support mmap().
 
 The only real is it could be a bit tricky to allocate all of the memory
 for the notes section on high cpu count systems in a single allocation.
 

 Do you mean it's getting difficult on many-cpus machine to get free
 pages consequtive enough to be able to cover all the notes?

 If so, is it necessary to think about any care to it in the next
 patch? Or, should it be pending for now?
 
 I meant that in general allocations  PAGE_SIZE get increasingly
 unreliable the larger they are.  And on large cpu count machines we are
 having larger allocations.  Of course large cpu count machines typically
 have more memory so the odds go up.
 
 Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
 machine certainly succeeded in an order 11 allocation during boot so I
 don't expect any real problems with a 2MiB allocation but it is
 something to keep an eye on with kernel memory.
 

OK, rigorously, suceess or faliure of the requested free pages
allocation depends on actual memory layout at the 2nd kernel boot. To
increase the possibility of allocating memory, we have no method but
reserve more memory for the 2nd kernel now.

 For mmap() of memory regions which are not page aligned, we can map
 extra bytes (as you suggested in one of the mails). Given the fact
 that we have one ELF header for every memory range, we can always modify
 the file offset where phdr data is starting to make space for mapping
 of extra bytes.
 
 Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
 make mmap work.
 

 OK, your conclusion is the 1st version is better than the 2nd.

 The purpose of this design was not to export anything but dump target
 memory to user-space from /proc/vmcore. I think it better to do it if
 possible. it's possible for read interface to fill the corresponding
 part with 0. But it's impossible for mmap interface to data on modify
 old memory.
 
 In practice someone lied.  You can't have a chunk of memory that is
 smaller than page size.  So I don't see it doing any harm to export
 the memory that is there but some silly system lied to us about.
 
 Do you agree two vmcores seen from read and mmap interfaces are no
 longer coincide?
 
 That is an interesting point.  I don't think there is any point in
 having read and mmap disagree, that just seems to lead to complications,
 especially since the data we are talking about adding is actually memory
 contents.
 
 I do think it makes sense to have logical chunks of the file that are
 not covered by PT_LOAD segments.  Logical chunks like the leading edge
 of a page inside of which a PT_LOAD segment starts, and the trailing
 edge of a page in which a PT_LOAD segment ends.
 
 Implementaton wise this would mean extending the struct vmcore entry to
 cover missing bits, by rounding down the start address and rounding up
 the end address to the nearest page size boundary.  The generated
 PT_LOAD segment would then have it's file offset adjusted to point skip
 the bytes of the page that are there but we don't care about.

Do you mean for each range represented by each PT_LOAD entry, say:

  [p_paddr, p_paddr + p_memsz]

extend it as:

  [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].

not only objects in vmcore_list, but also updating p_paddr and p_memsz
members themselves of each

Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 Do you mean for each range represented by each PT_LOAD entry, say:

   [p_paddr, p_paddr + p_memsz]

 extend it as:

   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].

 not only objects in vmcore_list, but also updating p_paddr and p_memsz
 members themselves of each PT_LOAD entry? In other words, there's no
 new holes not referenced by any PT_LOAD entry since the regions
 referenced by some PT_LOAD entry, themselves are extended.

No.  p_paddr and p_memsz as exported should remain the same.
I am suggesting that we change p_offset.

I am suggesting to include the data in the file as if we had changed
p_paddr and p_memsz.

 Then, the vmcores seen from read and mmap methods are coincide in the
 direction of including both ranges

   [rounddown(p_paddr, PAGE_SIZE), p_paddr]

 and

   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]

 are included in both vmcores seen from read and mmap methods, although
 they are originally not dump target memory, which you are not
 problematic for ease of implementation.

 Is there difference here from you understanding?

Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
important. p_offset we can change as much as we want.  Which means there
can be logical holes in the file between PT_LOAD segments, where we put
the extra data needed to keep everything page aligned.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Eric W. Biederman ebied...@xmission.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 23:29:05 -0700

 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 Do you mean for each range represented by each PT_LOAD entry, say:

   [p_paddr, p_paddr + p_memsz]

 extend it as:

   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].

 not only objects in vmcore_list, but also updating p_paddr and p_memsz
 members themselves of each PT_LOAD entry? In other words, there's no
 new holes not referenced by any PT_LOAD entry since the regions
 referenced by some PT_LOAD entry, themselves are extended.
 
 No.  p_paddr and p_memsz as exported should remain the same.
 I am suggesting that we change p_offset.
 
 I am suggesting to include the data in the file as if we had changed
 p_paddr and p_memsz.
 
 Then, the vmcores seen from read and mmap methods are coincide in the
 direction of including both ranges

   [rounddown(p_paddr, PAGE_SIZE), p_paddr]

 and

   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]

 are included in both vmcores seen from read and mmap methods, although
 they are originally not dump target memory, which you are not
 problematic for ease of implementation.

 Is there difference here from you understanding?
 
 Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
 important. p_offset we can change as much as we want.  Which means there
 can be logical holes in the file between PT_LOAD segments, where we put
 the extra data needed to keep everything page aligned.
 

So, I have to make the same question again. Is it OK if two vmcores
are different? How do you intend the ``extra data'' to be deal with? I
mean mmap() has to export part of old memory as the ``extra data''.

If you think OK, I'll fill the ``extra data'' with 0 in case of read
method. If not OK, I'll fill with the corresponding part of old
memory.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 From: Eric W. Biederman ebied...@xmission.com
 Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
 page-size boundary requirement
 Date: Wed, 20 Mar 2013 23:29:05 -0700

 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 Do you mean for each range represented by each PT_LOAD entry, say:

   [p_paddr, p_paddr + p_memsz]

 extend it as:

   [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].

 not only objects in vmcore_list, but also updating p_paddr and p_memsz
 members themselves of each PT_LOAD entry? In other words, there's no
 new holes not referenced by any PT_LOAD entry since the regions
 referenced by some PT_LOAD entry, themselves are extended.
 
 No.  p_paddr and p_memsz as exported should remain the same.
 I am suggesting that we change p_offset.
 
 I am suggesting to include the data in the file as if we had changed
 p_paddr and p_memsz.
 
 Then, the vmcores seen from read and mmap methods are coincide in the
 direction of including both ranges

   [rounddown(p_paddr, PAGE_SIZE), p_paddr]

 and

   [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]

 are included in both vmcores seen from read and mmap methods, although
 they are originally not dump target memory, which you are not
 problematic for ease of implementation.

 Is there difference here from you understanding?
 
 Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
 important. p_offset we can change as much as we want.  Which means there
 can be logical holes in the file between PT_LOAD segments, where we put
 the extra data needed to keep everything page aligned.
 

 So, I have to make the same question again. Is it OK if two vmcores
 are different? How do you intend the ``extra data'' to be deal with? I
 mean mmap() has to export part of old memory as the ``extra data''.

 If you think OK, I'll fill the ``extra data'' with 0 in case of read
 method. If not OK, I'll fill with the corresponding part of old
 memory.

I think the two having different contents violates the principle of
least surprise.

I think exporting the old memory as the ``extra data'' is the least
surprising and the easiest way to go.

I don't mind filling the extra data with zero's but I don't see the
point.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 OK, rigorously, suceess or faliure of the requested free pages
 allocation depends on actual memory layout at the 2nd kernel boot. To
 increase the possibility of allocating memory, we have no method but
 reserve more memory for the 2nd kernel now.

Good enough.   If there are fragmentation issues that cause allocation
problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
we certainly don't need to start there.

Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
an 8KiBP allocation.  Aka order 0 or order 1.

Adding more memory is also useful.  It is important in general to keep
the amount of memory needed for the kdump kernel low.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Wed, Mar 20, 2013 at 01:55:55PM -0700, Eric W. Biederman wrote:

[..]
 If core counts on the high end do more than double every 2 years we
 might have a problem.  Otherwise making everything mmapable seems easy
 and sound.

We already have mechanism to translate file offset into actual physical
address where data is. So if we can't allocate one contiguous chunk of
memory for notes, we should be able to break it down into multiple
page aligned areas and map offset into respective discontiguous areas
using vmcore_list.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  OK, rigorously, suceess or faliure of the requested free pages
  allocation depends on actual memory layout at the 2nd kernel boot. To
  increase the possibility of allocating memory, we have no method but
  reserve more memory for the 2nd kernel now.
 
 Good enough.   If there are fragmentation issues that cause allocation
 problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
 we certainly don't need to start there.
 
 Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
 an 8KiBP allocation.  Aka order 0 or order 1.
 

Actually we are already handling the large SGI machines so we need
to plan for 4096 cpus now while we write these patches.

vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
we should probaly use.

Alternatively why not allocate everything in 4K pages and use vmcore_list
to map offset into right addresses and call remap_pfn_range() on these
addresses.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Wed, Mar 20, 2013 at 11:29:05PM -0700, Eric W. Biederman wrote:

[..]
 Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
 important. p_offset we can change as much as we want.  Which means there
 can be logical holes in the file between PT_LOAD segments, where we put
 the extra data needed to keep everything page aligned.

Agreed. If one modifies p_paddr then one will have to modify p-vaddr
too. And user space tools look at p-vaddr and where is the corresponding
physical address. Keeping p_vaddr and p_paddr intact makes sense.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 12:07:12AM -0700, Eric W. Biederman wrote:

[..]
 I think the two having different contents violates the principle of
 least surprise.
 
 I think exporting the old memory as the ``extra data'' is the least
 surprising and the easiest way to go.
 
 I don't mind filling the extra data with zero's but I don't see the
 point.

I think only question would be if there is a problem in reading memory
areas which BIOS has kept reserved or possibly not exported. Are there
any surprises to be expected. (machines reboots while trying to reboot
a particular memory location etc).

So trying to zero the extra data can make theoritically make it somewhat
safer.

So if starting or end address of PT_LOAD header is not aligned, why
not we simply allocate a page. Copy the relevant data from old memory,
fill rest with zero. That way mmap and read view will be same. There
will be no surprises w.r.t reading old kernel memory beyond what's 
specified by the headers.

And in practice I am not expecting many PT_LOAD ranges which are unaligned.
Just few. And allocating a few 4K pages should not be a big deal.

And vmcore_list will help us again map whether pfn lies in old memory
or new memory. 

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Vivek Goyal
On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:

[..]
 So if starting or end address of PT_LOAD header is not aligned, why
 not we simply allocate a page. Copy the relevant data from old memory,
 fill rest with zero. That way mmap and read view will be same. There
 will be no surprises w.r.t reading old kernel memory beyond what's 
 specified by the headers.

Copying from old memory might spring surprises w.r.t hw poisoned
pages. I guess we will have to disable MCE, read page, enable it
back or something like that to take care of these issues.

In the past we have recommended makedumpfile to be careful, look
at struct pages and make sure we are not reading poisoned pages.
But vmcore itself is reading old memory and can run into this
issue too.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Vivek Goyal vgo...@redhat.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 11:27:51 -0400

 On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
 
 [..]
 So if starting or end address of PT_LOAD header is not aligned, why
 not we simply allocate a page. Copy the relevant data from old memory,
 fill rest with zero. That way mmap and read view will be same. There
 will be no surprises w.r.t reading old kernel memory beyond what's 
 specified by the headers.
 
 Copying from old memory might spring surprises w.r.t hw poisoned
 pages. I guess we will have to disable MCE, read page, enable it
 back or something like that to take care of these issues.
 
 In the past we have recommended makedumpfile to be careful, look
 at struct pages and make sure we are not reading poisoned pages.
 But vmcore itself is reading old memory and can run into this
 issue too.

Yes, that has been already implemented in makedumpfile.

Not only copying, but also mmaping poisoned pages might be problematic
due to hardware cache prefetch performed by creation of page table to
the poisoned pages. Or MCE disables the prefetch? I'm not sure but
I'll investigate this. makedumpfile might also take care of calling
mmap.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread Eric W. Biederman
Vivek Goyal vgo...@redhat.com writes:

 On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:

 [..]
 So if starting or end address of PT_LOAD header is not aligned, why
 not we simply allocate a page. Copy the relevant data from old memory,
 fill rest with zero. That way mmap and read view will be same. There
 will be no surprises w.r.t reading old kernel memory beyond what's 
 specified by the headers.

 Copying from old memory might spring surprises w.r.t hw poisoned
 pages. I guess we will have to disable MCE, read page, enable it
 back or something like that to take care of these issues.

 In the past we have recommended makedumpfile to be careful, look
 at struct pages and make sure we are not reading poisoned pages.
 But vmcore itself is reading old memory and can run into this
 issue too.

Vivek you are overthinking this.

If there are issues with reading partially exported pages we should
fix them in kexec-tools or in the kernel where the data is exported.

In the examples given in the patch what we were looking at were cases
where the BIOS rightly or wrongly was saying kernel this is my memory
stay off.  But it was all perfectly healthy memory.

/proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
way so that we can predict how it will act when we feed it information.
/proc/vmcore should not be worrying about or covering up sins elsewhere
in the system.

At the level of /proc/vmcore we may want to do something about ensuring
MCE's don't kill us.  But that is an orthogonal problem.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-21 Thread HATAYAMA Daisuke
From: Eric W. Biederman ebied...@xmission.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Thu, 21 Mar 2013 17:54:22 -0700

 Vivek Goyal vgo...@redhat.com writes:
 
 On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:

 [..]
 So if starting or end address of PT_LOAD header is not aligned, why
 not we simply allocate a page. Copy the relevant data from old memory,
 fill rest with zero. That way mmap and read view will be same. There
 will be no surprises w.r.t reading old kernel memory beyond what's 
 specified by the headers.

 Copying from old memory might spring surprises w.r.t hw poisoned
 pages. I guess we will have to disable MCE, read page, enable it
 back or something like that to take care of these issues.

 In the past we have recommended makedumpfile to be careful, look
 at struct pages and make sure we are not reading poisoned pages.
 But vmcore itself is reading old memory and can run into this
 issue too.
 
 Vivek you are overthinking this.
 
 If there are issues with reading partially exported pages we should
 fix them in kexec-tools or in the kernel where the data is exported.
 
 In the examples given in the patch what we were looking at were cases
 where the BIOS rightly or wrongly was saying kernel this is my memory
 stay off.  But it was all perfectly healthy memory.
 
 /proc/vmcore is a simple data dumper and prettifier.  Let's keep it that
 way so that we can predict how it will act when we feed it information.
 /proc/vmcore should not be worrying about or covering up sins elsewhere
 in the system.
 
 At the level of /proc/vmcore we may want to do something about ensuring
 MCE's don't kill us.  But that is an orthogonal problem.

This is the part of old memory /proc/vmcore must read at its
initialization to generate its meta data, i.e. ELF header, program
header table and ELF note segments. Other memory chunks are part
makedumpfile should decide whether to read or avoid.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:

> From: "Eric W. Biederman" 
> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
> page-size boundary requirement
> Date: Wed, 20 Mar 2013 13:55:55 -0700
>
>> Vivek Goyal  writes:
>> 
>>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>>> HATAYAMA Daisuke  writes:
>>>> 
>>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>>> >
>>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>>> > old memory to buffer in 2nd kernel done in later patch.
>>>> >
>>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>>> > they amount to so large in total if there are multiple CPUs. For
>>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>>> > segments exceed 1MB with NT_PRSTATUS only.
>>>> 
>>>> So you require the first kernel to reserve an additional 20MB, instead
>>>> of just 1.6MB.  336 bytes versus 4096 bytes.
>>>> 
>>>> That seems like completely the wrong tradeoff in memory consumption,
>>>> filesize, and backwards compatibility.
>>>
>>> Agreed. 
>>>
>>> So we already copy ELF headers in second kernel's memory. If we start
>>> copying notes too, then both headers and notes will support mmap().
>> 
>> The only real is it could be a bit tricky to allocate all of the memory
>> for the notes section on high cpu count systems in a single allocation.
>> 
>
> Do you mean it's getting difficult on many-cpus machine to get free
> pages consequtive enough to be able to cover all the notes?
>
> If so, is it necessary to think about any care to it in the next
> patch? Or, should it be pending for now?

I meant that in general allocations > PAGE_SIZE get increasingly
unreliable the larger they are.  And on large cpu count machines we are
having larger allocations.  Of course large cpu count machines typically
have more memory so the odds go up.

Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
machine certainly succeeded in an order 11 allocation during boot so I
don't expect any real problems with a 2MiB allocation but it is
something to keep an eye on with kernel memory.

>>> For mmap() of memory regions which are not page aligned, we can map
>>> extra bytes (as you suggested in one of the mails). Given the fact
>>> that we have one ELF header for every memory range, we can always modify
>>> the file offset where phdr data is starting to make space for mapping
>>> of extra bytes.
>> 
>> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
>> make mmap work.
>> 
>
> OK, your conclusion is the 1st version is better than the 2nd.
>
> The purpose of this design was not to export anything but dump target
> memory to user-space from /proc/vmcore. I think it better to do it if
> possible. it's possible for read interface to fill the corresponding
> part with 0. But it's impossible for mmap interface to data on modify
> old memory.

In practice someone lied.  You can't have a chunk of memory that is
smaller than page size.  So I don't see it doing any harm to export
the memory that is there but some silly system lied to us about.

> Do you agree two vmcores seen from read and mmap interfaces are no
> longer coincide?

That is an interesting point.  I don't think there is any point in
having read and mmap disagree, that just seems to lead to complications,
especially since the data we are talking about adding is actually memory
contents.

I do think it makes sense to have logical chunks of the file that are
not covered by PT_LOAD segments.  Logical chunks like the leading edge
of a page inside of which a PT_LOAD segment starts, and the trailing
edge of a page in which a PT_LOAD segment ends.

Implementaton wise this would mean extending the struct vmcore entry to
cover missing bits, by rounding down the start address and rounding up
the end address to the nearest page size boundary.  The generated
PT_LOAD segment would then have it's file offset adjusted to point skip
the bytes of the page that are there but we don't care about.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread HATAYAMA Daisuke
From: "Eric W. Biederman" 
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 13:55:55 -0700

> Vivek Goyal  writes:
> 
>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>> HATAYAMA Daisuke  writes:
>>> 
>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>> >
>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>> > old memory to buffer in 2nd kernel done in later patch.
>>> >
>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>> > they amount to so large in total if there are multiple CPUs. For
>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>> > segments exceed 1MB with NT_PRSTATUS only.
>>> 
>>> So you require the first kernel to reserve an additional 20MB, instead
>>> of just 1.6MB.  336 bytes versus 4096 bytes.
>>> 
>>> That seems like completely the wrong tradeoff in memory consumption,
>>> filesize, and backwards compatibility.
>>
>> Agreed. 
>>
>> So we already copy ELF headers in second kernel's memory. If we start
>> copying notes too, then both headers and notes will support mmap().
> 
> The only real is it could be a bit tricky to allocate all of the memory
> for the notes section on high cpu count systems in a single allocation.
> 

Do you mean it's getting difficult on many-cpus machine to get free
pages consequtive enough to be able to cover all the notes?

If so, is it necessary to think about any care to it in the next
patch? Or, should it be pending for now?

>> For mmap() of memory regions which are not page aligned, we can map
>> extra bytes (as you suggested in one of the mails). Given the fact
>> that we have one ELF header for every memory range, we can always modify
>> the file offset where phdr data is starting to make space for mapping
>> of extra bytes.
> 
> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
> make mmap work.
> 

OK, your conclusion is the 1st version is better than the 2nd.

The purpose of this design was not to export anything but dump target
memory to user-space from /proc/vmcore. I think it better to do it if
possible. it's possible for read interface to fill the corresponding
part with 0. But it's impossible for mmap interface to data on modify
old memory.

Do you agree two vmcores seen from read and mmap interfaces are no
longer coincide?

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Eric W. Biederman
Vivek Goyal  writes:

> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>> HATAYAMA Daisuke  writes:
>> 
>> > If there's some vmcore object that doesn't satisfy page-size boundary
>> > requirement, remap_pfn_range() fails to remap it to user-space.
>> >
>> > Objects that posisbly don't satisfy the requirement are ELF note
>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>> > guaranteed to satisfy page-size boundary requirement by the copy from
>> > old memory to buffer in 2nd kernel done in later patch.
>> >
>> > This patch doesn't copy each note segment into the 2nd kernel since
>> > they amount to so large in total if there are multiple CPUs. For
>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>> > segments exceed 1MB with NT_PRSTATUS only.
>> 
>> So you require the first kernel to reserve an additional 20MB, instead
>> of just 1.6MB.  336 bytes versus 4096 bytes.
>> 
>> That seems like completely the wrong tradeoff in memory consumption,
>> filesize, and backwards compatibility.
>
> Agreed. 
>
> So we already copy ELF headers in second kernel's memory. If we start
> copying notes too, then both headers and notes will support mmap().

The only real is it could be a bit tricky to allocate all of the memory
for the notes section on high cpu count systems in a single allocation.

> For mmap() of memory regions which are not page aligned, we can map
> extra bytes (as you suggested in one of the mails). Given the fact
> that we have one ELF header for every memory range, we can always modify
> the file offset where phdr data is starting to make space for mapping
> of extra bytes.

Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
make mmap work.

> That way whole of vmcore should be mmappable and user does not have
> to worry about reading part of the file and mmaping the rest.

That sounds simplest.

If core counts on the high end do more than double every 2 years we
might have a problem.  Otherwise making everything mmapable seems easy
and sound.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Vivek Goyal
On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke  writes:
> 
> > If there's some vmcore object that doesn't satisfy page-size boundary
> > requirement, remap_pfn_range() fails to remap it to user-space.
> >
> > Objects that posisbly don't satisfy the requirement are ELF note
> > segments only. The memory chunks corresponding to PT_LOAD entries are
> > guaranteed to satisfy page-size boundary requirement by the copy from
> > old memory to buffer in 2nd kernel done in later patch.
> >
> > This patch doesn't copy each note segment into the 2nd kernel since
> > they amount to so large in total if there are multiple CPUs. For
> > example, current maximum number of CPUs in x86_64 is 5120, where note
> > segments exceed 1MB with NT_PRSTATUS only.
> 
> So you require the first kernel to reserve an additional 20MB, instead
> of just 1.6MB.  336 bytes versus 4096 bytes.
> 
> That seems like completely the wrong tradeoff in memory consumption,
> filesize, and backwards compatibility.

Agreed. 

So we already copy ELF headers in second kernel's memory. If we start
copying notes too, then both headers and notes will support mmap().

For mmap() of memory regions which are not page aligned, we can map
extra bytes (as you suggested in one of the mails). Given the fact
that we have one ELF header for every memory range, we can always modify
the file offset where phdr data is starting to make space for mapping
of extra bytes.

That way whole of vmcore should be mmappable and user does not have
to worry about reading part of the file and mmaping the rest.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Vivek Goyal
On Tue, Mar 19, 2013 at 01:02:29PM -0700, Andrew Morton wrote:
> On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke 
>  wrote:
> 
> > If there's some vmcore object that doesn't satisfy page-size boundary
> > requirement, remap_pfn_range() fails to remap it to user-space.
> > 
> > Objects that posisbly don't satisfy the requirement are ELF note
> > segments only. The memory chunks corresponding to PT_LOAD entries are
> > guaranteed to satisfy page-size boundary requirement by the copy from
> > old memory to buffer in 2nd kernel done in later patch.
> > 
> > This patch doesn't copy each note segment into the 2nd kernel since
> > they amount to so large in total if there are multiple CPUs. For
> > example, current maximum number of CPUs in x86_64 is 5120, where note
> > segments exceed 1MB with NT_PRSTATUS only.
> 
> I don't really understand this.  Why does the number of or size of
> note segments affect their alignment?
> 
> > --- a/fs/proc/vmcore.c
> > +++ b/fs/proc/vmcore.c
> > @@ -38,6 +38,8 @@ static u64 vmcore_size;
> >  
> >  static struct proc_dir_entry *proc_vmcore = NULL;
> >  
> > +static bool support_mmap_vmcore;
> 
> This is quite regrettable.  It means that on some kernels/machines,
> mmap(vmcore) simply won't work.  This means that people might write
> code which works for them, but which will fail for others when deployed
> on a small number of machines.
> 
> Can we avoid this?  Why can't we just copy the notes even if there are
> a large number of them?

Actually initially he implemented copying notes to second kernel and I 
suggested to go other way (Tried too hard to save memory in second
kernel). I guess it was not a good idea and copying notes keeps it simple.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Vivek Goyal
On Tue, Mar 19, 2013 at 01:02:29PM -0700, Andrew Morton wrote:
 On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke 
 d.hatay...@jp.fujitsu.com wrote:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
  
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
  
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 I don't really understand this.  Why does the number of or size of
 note segments affect their alignment?
 
  --- a/fs/proc/vmcore.c
  +++ b/fs/proc/vmcore.c
  @@ -38,6 +38,8 @@ static u64 vmcore_size;
   
   static struct proc_dir_entry *proc_vmcore = NULL;
   
  +static bool support_mmap_vmcore;
 
 This is quite regrettable.  It means that on some kernels/machines,
 mmap(vmcore) simply won't work.  This means that people might write
 code which works for them, but which will fail for others when deployed
 on a small number of machines.
 
 Can we avoid this?  Why can't we just copy the notes even if there are
 a large number of them?

Actually initially he implemented copying notes to second kernel and I 
suggested to go other way (Tried too hard to save memory in second
kernel). I guess it was not a good idea and copying notes keeps it simple.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Vivek Goyal
On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
 
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
 
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 So you require the first kernel to reserve an additional 20MB, instead
 of just 1.6MB.  336 bytes versus 4096 bytes.
 
 That seems like completely the wrong tradeoff in memory consumption,
 filesize, and backwards compatibility.

Agreed. 

So we already copy ELF headers in second kernel's memory. If we start
copying notes too, then both headers and notes will support mmap().

For mmap() of memory regions which are not page aligned, we can map
extra bytes (as you suggested in one of the mails). Given the fact
that we have one ELF header for every memory range, we can always modify
the file offset where phdr data is starting to make space for mapping
of extra bytes.

That way whole of vmcore should be mmappable and user does not have
to worry about reading part of the file and mmaping the rest.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Eric W. Biederman
Vivek Goyal vgo...@redhat.com writes:

 On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
 
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
 
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 So you require the first kernel to reserve an additional 20MB, instead
 of just 1.6MB.  336 bytes versus 4096 bytes.
 
 That seems like completely the wrong tradeoff in memory consumption,
 filesize, and backwards compatibility.

 Agreed. 

 So we already copy ELF headers in second kernel's memory. If we start
 copying notes too, then both headers and notes will support mmap().

The only real is it could be a bit tricky to allocate all of the memory
for the notes section on high cpu count systems in a single allocation.

 For mmap() of memory regions which are not page aligned, we can map
 extra bytes (as you suggested in one of the mails). Given the fact
 that we have one ELF header for every memory range, we can always modify
 the file offset where phdr data is starting to make space for mapping
 of extra bytes.

Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
make mmap work.

 That way whole of vmcore should be mmappable and user does not have
 to worry about reading part of the file and mmaping the rest.

That sounds simplest.

If core counts on the high end do more than double every 2 years we
might have a problem.  Otherwise making everything mmapable seems easy
and sound.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread HATAYAMA Daisuke
From: Eric W. Biederman ebied...@xmission.com
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
page-size boundary requirement
Date: Wed, 20 Mar 2013 13:55:55 -0700

 Vivek Goyal vgo...@redhat.com writes:
 
 On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
 
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
 
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 So you require the first kernel to reserve an additional 20MB, instead
 of just 1.6MB.  336 bytes versus 4096 bytes.
 
 That seems like completely the wrong tradeoff in memory consumption,
 filesize, and backwards compatibility.

 Agreed. 

 So we already copy ELF headers in second kernel's memory. If we start
 copying notes too, then both headers and notes will support mmap().
 
 The only real is it could be a bit tricky to allocate all of the memory
 for the notes section on high cpu count systems in a single allocation.
 

Do you mean it's getting difficult on many-cpus machine to get free
pages consequtive enough to be able to cover all the notes?

If so, is it necessary to think about any care to it in the next
patch? Or, should it be pending for now?

 For mmap() of memory regions which are not page aligned, we can map
 extra bytes (as you suggested in one of the mails). Given the fact
 that we have one ELF header for every memory range, we can always modify
 the file offset where phdr data is starting to make space for mapping
 of extra bytes.
 
 Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
 make mmap work.
 

OK, your conclusion is the 1st version is better than the 2nd.

The purpose of this design was not to export anything but dump target
memory to user-space from /proc/vmcore. I think it better to do it if
possible. it's possible for read interface to fill the corresponding
part with 0. But it's impossible for mmap interface to data on modify
old memory.

Do you agree two vmcores seen from read and mmap interfaces are no
longer coincide?

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-20 Thread Eric W. Biederman
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 From: Eric W. Biederman ebied...@xmission.com
 Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s 
 page-size boundary requirement
 Date: Wed, 20 Mar 2013 13:55:55 -0700

 Vivek Goyal vgo...@redhat.com writes:
 
 On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:
 
  If there's some vmcore object that doesn't satisfy page-size boundary
  requirement, remap_pfn_range() fails to remap it to user-space.
 
  Objects that posisbly don't satisfy the requirement are ELF note
  segments only. The memory chunks corresponding to PT_LOAD entries are
  guaranteed to satisfy page-size boundary requirement by the copy from
  old memory to buffer in 2nd kernel done in later patch.
 
  This patch doesn't copy each note segment into the 2nd kernel since
  they amount to so large in total if there are multiple CPUs. For
  example, current maximum number of CPUs in x86_64 is 5120, where note
  segments exceed 1MB with NT_PRSTATUS only.
 
 So you require the first kernel to reserve an additional 20MB, instead
 of just 1.6MB.  336 bytes versus 4096 bytes.
 
 That seems like completely the wrong tradeoff in memory consumption,
 filesize, and backwards compatibility.

 Agreed. 

 So we already copy ELF headers in second kernel's memory. If we start
 copying notes too, then both headers and notes will support mmap().
 
 The only real is it could be a bit tricky to allocate all of the memory
 for the notes section on high cpu count systems in a single allocation.
 

 Do you mean it's getting difficult on many-cpus machine to get free
 pages consequtive enough to be able to cover all the notes?

 If so, is it necessary to think about any care to it in the next
 patch? Or, should it be pending for now?

I meant that in general allocations  PAGE_SIZE get increasingly
unreliable the larger they are.  And on large cpu count machines we are
having larger allocations.  Of course large cpu count machines typically
have more memory so the odds go up.

Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
machine certainly succeeded in an order 11 allocation during boot so I
don't expect any real problems with a 2MiB allocation but it is
something to keep an eye on with kernel memory.

 For mmap() of memory regions which are not page aligned, we can map
 extra bytes (as you suggested in one of the mails). Given the fact
 that we have one ELF header for every memory range, we can always modify
 the file offset where phdr data is starting to make space for mapping
 of extra bytes.
 
 Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
 make mmap work.
 

 OK, your conclusion is the 1st version is better than the 2nd.

 The purpose of this design was not to export anything but dump target
 memory to user-space from /proc/vmcore. I think it better to do it if
 possible. it's possible for read interface to fill the corresponding
 part with 0. But it's impossible for mmap interface to data on modify
 old memory.

In practice someone lied.  You can't have a chunk of memory that is
smaller than page size.  So I don't see it doing any harm to export
the memory that is there but some silly system lied to us about.

 Do you agree two vmcores seen from read and mmap interfaces are no
 longer coincide?

That is an interesting point.  I don't think there is any point in
having read and mmap disagree, that just seems to lead to complications,
especially since the data we are talking about adding is actually memory
contents.

I do think it makes sense to have logical chunks of the file that are
not covered by PT_LOAD segments.  Logical chunks like the leading edge
of a page inside of which a PT_LOAD segment starts, and the trailing
edge of a page in which a PT_LOAD segment ends.

Implementaton wise this would mean extending the struct vmcore entry to
cover missing bits, by rounding down the start address and rounding up
the end address to the nearest page size boundary.  The generated
PT_LOAD segment would then have it's file offset adjusted to point skip
the bytes of the page that are there but we don't care about.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Eric W. Biederman
HATAYAMA Daisuke  writes:

> If there's some vmcore object that doesn't satisfy page-size boundary
> requirement, remap_pfn_range() fails to remap it to user-space.
>
> Objects that posisbly don't satisfy the requirement are ELF note
> segments only. The memory chunks corresponding to PT_LOAD entries are
> guaranteed to satisfy page-size boundary requirement by the copy from
> old memory to buffer in 2nd kernel done in later patch.
>
> This patch doesn't copy each note segment into the 2nd kernel since
> they amount to so large in total if there are multiple CPUs. For
> example, current maximum number of CPUs in x86_64 is 5120, where note
> segments exceed 1MB with NT_PRSTATUS only.

So you require the first kernel to reserve an additional 20MB, instead
of just 1.6MB.  336 bytes versus 4096 bytes.

That seems like completely the wrong tradeoff in memory consumption,
filesize, and backwards compatibility.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Eric W. Biederman
Andrew Morton  writes:

> On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke 
>  wrote:
>
>> If there's some vmcore object that doesn't satisfy page-size boundary
>> requirement, remap_pfn_range() fails to remap it to user-space.
>> 
>> Objects that posisbly don't satisfy the requirement are ELF note
>> segments only. The memory chunks corresponding to PT_LOAD entries are
>> guaranteed to satisfy page-size boundary requirement by the copy from
>> old memory to buffer in 2nd kernel done in later patch.
>> 
>> This patch doesn't copy each note segment into the 2nd kernel since
>> they amount to so large in total if there are multiple CPUs. For
>> example, current maximum number of CPUs in x86_64 is 5120, where note
>> segments exceed 1MB with NT_PRSTATUS only.
>
> I don't really understand this.  Why does the number of or size of
> note segments affect their alignment?
>
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -38,6 +38,8 @@ static u64 vmcore_size;
>>  
>>  static struct proc_dir_entry *proc_vmcore = NULL;
>>  
>> +static bool support_mmap_vmcore;
>
> This is quite regrettable.  It means that on some kernels/machines,
> mmap(vmcore) simply won't work.  This means that people might write
> code which works for them, but which will fail for others when deployed
> on a small number of machines.
>
> Can we avoid this?  Why can't we just copy the notes even if there are
> a large number of them?

Yes.  If it simplifies things I don't see a need to support mmapping
everything.  But even there I don't see much of an issue.

Today we allocate a buffer to hold the ELF header program headers and
the note segment, and we could easily allocate that buffer in such a way
to make it mmapable.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Andrew Morton
On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke  
wrote:

> If there's some vmcore object that doesn't satisfy page-size boundary
> requirement, remap_pfn_range() fails to remap it to user-space.
> 
> Objects that posisbly don't satisfy the requirement are ELF note
> segments only. The memory chunks corresponding to PT_LOAD entries are
> guaranteed to satisfy page-size boundary requirement by the copy from
> old memory to buffer in 2nd kernel done in later patch.
> 
> This patch doesn't copy each note segment into the 2nd kernel since
> they amount to so large in total if there are multiple CPUs. For
> example, current maximum number of CPUs in x86_64 is 5120, where note
> segments exceed 1MB with NT_PRSTATUS only.

I don't really understand this.  Why does the number of or size of
note segments affect their alignment?

> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -38,6 +38,8 @@ static u64 vmcore_size;
>  
>  static struct proc_dir_entry *proc_vmcore = NULL;
>  
> +static bool support_mmap_vmcore;

This is quite regrettable.  It means that on some kernels/machines,
mmap(vmcore) simply won't work.  This means that people might write
code which works for them, but which will fail for others when deployed
on a small number of machines.

Can we avoid this?  Why can't we just copy the notes even if there are
a large number of them?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Andrew Morton
On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com 
wrote:

 If there's some vmcore object that doesn't satisfy page-size boundary
 requirement, remap_pfn_range() fails to remap it to user-space.
 
 Objects that posisbly don't satisfy the requirement are ELF note
 segments only. The memory chunks corresponding to PT_LOAD entries are
 guaranteed to satisfy page-size boundary requirement by the copy from
 old memory to buffer in 2nd kernel done in later patch.
 
 This patch doesn't copy each note segment into the 2nd kernel since
 they amount to so large in total if there are multiple CPUs. For
 example, current maximum number of CPUs in x86_64 is 5120, where note
 segments exceed 1MB with NT_PRSTATUS only.

I don't really understand this.  Why does the number of or size of
note segments affect their alignment?

 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -38,6 +38,8 @@ static u64 vmcore_size;
  
  static struct proc_dir_entry *proc_vmcore = NULL;
  
 +static bool support_mmap_vmcore;

This is quite regrettable.  It means that on some kernels/machines,
mmap(vmcore) simply won't work.  This means that people might write
code which works for them, but which will fail for others when deployed
on a small number of machines.

Can we avoid this?  Why can't we just copy the notes even if there are
a large number of them?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Eric W. Biederman
Andrew Morton a...@linux-foundation.org writes:

 On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke 
 d.hatay...@jp.fujitsu.com wrote:

 If there's some vmcore object that doesn't satisfy page-size boundary
 requirement, remap_pfn_range() fails to remap it to user-space.
 
 Objects that posisbly don't satisfy the requirement are ELF note
 segments only. The memory chunks corresponding to PT_LOAD entries are
 guaranteed to satisfy page-size boundary requirement by the copy from
 old memory to buffer in 2nd kernel done in later patch.
 
 This patch doesn't copy each note segment into the 2nd kernel since
 they amount to so large in total if there are multiple CPUs. For
 example, current maximum number of CPUs in x86_64 is 5120, where note
 segments exceed 1MB with NT_PRSTATUS only.

 I don't really understand this.  Why does the number of or size of
 note segments affect their alignment?

 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -38,6 +38,8 @@ static u64 vmcore_size;
  
  static struct proc_dir_entry *proc_vmcore = NULL;
  
 +static bool support_mmap_vmcore;

 This is quite regrettable.  It means that on some kernels/machines,
 mmap(vmcore) simply won't work.  This means that people might write
 code which works for them, but which will fail for others when deployed
 on a small number of machines.

 Can we avoid this?  Why can't we just copy the notes even if there are
 a large number of them?

Yes.  If it simplifies things I don't see a need to support mmapping
everything.  But even there I don't see much of an issue.

Today we allocate a buffer to hold the ELF header program headers and
the note segment, and we could easily allocate that buffer in such a way
to make it mmapable.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement

2013-03-19 Thread Eric W. Biederman
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com writes:

 If there's some vmcore object that doesn't satisfy page-size boundary
 requirement, remap_pfn_range() fails to remap it to user-space.

 Objects that posisbly don't satisfy the requirement are ELF note
 segments only. The memory chunks corresponding to PT_LOAD entries are
 guaranteed to satisfy page-size boundary requirement by the copy from
 old memory to buffer in 2nd kernel done in later patch.

 This patch doesn't copy each note segment into the 2nd kernel since
 they amount to so large in total if there are multiple CPUs. For
 example, current maximum number of CPUs in x86_64 is 5120, where note
 segments exceed 1MB with NT_PRSTATUS only.

So you require the first kernel to reserve an additional 20MB, instead
of just 1.6MB.  336 bytes versus 4096 bytes.

That seems like completely the wrong tradeoff in memory consumption,
filesize, and backwards compatibility.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/