Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Fri, Mar 1, 2013 at 7:03 PM, chen tang  wrote:
>
> Thank you for your suggestion and fix work. :)
> I would prefer your Plan b. But one last thing I want to confirm:
>
> Will "allocating pgdat and zone on local node" prevent node hot-removing ?
> Or is it safe to free all node data when removing a node ?
> AFAIK, no way to ensure node data is not on thread stack.

Not sure. I need to go over the code.
That is slub's limitation.

If it is not, it should be fixed.

>
> If it is OK, I think Plan B is OK, and we can improve movablemem_map more in
> the future.
>
> BTW, I didn't mean to deny your idea and work. NUMA performance is always
> understand our  consideration.
> It's just we plan it as a long way development in the future.
> movablemem_map is very important to us. And we do hope to keep it in kernel
> now, and improve it later.

That does not look like right way to do development with mainline tree
to add new
features.

You don't need to put development/testing support patches in the mainline.
Just put those support patches in your local tree.

Everyone have bunch of development/debug/teststub patches in their own
hardisk for their working area, but don't need put them into mainline tree.

Good practice should be:
Have the feature completely done in your local tree and etc.
then send out several patchset. and get reviewed and get merged
one by one.

Sometime would turn out that your whole patchset has problem that
can not be fixed during review, and should be redesign again.

Mainline tree is NOT testbed.

For pci-root-bus hotplug, I already had code done completely.
Then send out patchset one by one to get completely review.
One patchset about acpi-scan is totally rewritten by Rafael after he understood
our needs with better and clean design.
Now still have ioapic and iommu left, and those patchset have been in
my local tree more than 6 months and I keep optimizing them.

BTW, Please do not top-post later.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 11:43 PM, Yinghai Lu  wrote:
> [trim down CC list a bit]
>
> On Thu, Feb 28, 2013 at 9:00 PM, Yinghai Lu  wrote:
>>
>>
>> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>>>
>>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:
>>> > Yingai, Andrew,
>>> >  is this ok with you two?
>>> >
>>> > Linus
>>>
>>> FWIW, it makes sense to me iff it resolves the problems
>>
>>
>> I prefer to reverting all 8 patches.
>>
>> Actually I have worked out one patch that could solve all problems, but it
>> is too intrusive that I do  not want to split it to small pieces to post
>> it.
>>
>> Leaving the movablemem_map related changes in  the upstream tree, will
>> prevent me from continuing to make memblock to be used to allocate page
>> table on local node ram for hot add.
>>
>> Will send reverting patch and putting page table on local node patch around
>> 10pm after I get home.
>
> Please check attached patches.
>
> Plan A. revert all 8 patches:
> revert_movablemem_map.patch
>
> Plan B. fix movablemem_map:
> kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch
>
> fix_movablemem_map.patch is too risky, and need more test.
>
> Konrad, Stefano:
> Can you check kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch
> on top of today's Linus tree to check if it breaks Xen?
>

Sorry, miss change in setup.c during split the patch.

Thanks

Yinghai


fix_movablemem_map_v2.patch
Description: Binary data


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread H. Peter Anvin
On 02/28/2013 11:55 PM, Yinghai Lu wrote:
> 
> Let me try again:
> 
> movablemem_map is broken idea or poor design.
> 

Very much so.  I have said this before: this is potentially useful
during development/testing, but anyone who expects to actually tell
their customers to use it is abusive.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread H. Peter Anvin
If NUMAQ is breaking real stuff we can kill it by marking it BROKEN.  Rip-out 
is 3.10 at this stage.

Ingo Molnar  wrote:

>
>* Borislav Petkov  wrote:
>
>> On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
>> > I'd be very happy to get the NUMAQ code ripped out.  I am wondering
>if
>> > there are any reasons to keep any 32-bit x86 NUMA code at all.
>> 
>> How much would it hurt us if we said 3.8 is the last kernel that
>supported NUMAQ? 
>> If anyone wants the functionality, they should use 3.8 or older.
>
>v3.9 - any non-trivial patch in the stage of being contemplated near
>the end of the 
>v3.9 merge window is most likely v3.10 material.
>
>Thanks,
>
>   Ingo

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Tang Chen

On 03/01/2013 03:43 PM, Yinghai Lu wrote:

Please check attached patches.

Plan A. revert all 8 patches:
 revert_movablemem_map.patch

Plan B. fix movablemem_map:
 kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch

fix_movablemem_map.patch is too risky, and need more test.



Hi Yinghai,

In your Plan B, you allocated pgdat on local node, right ?

-nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+nd_pa = memblock_find_in_range_node(start, end, nd_size,
+ SMP_CACHE_BYTES, nid);   Here, 
right ?


Without movablemem_map, pgdat will be allocated successfully on local 
node, right ?


If so, this will prevent node hot-plug, because as mentioned by 
Kamezawa, there is

no way to ensure pgdat is not used by others on stack.

I do hope you can stop putting pgdat and zone on local node for now. And 
improve it

in the future.

And I also hope you can apply my revert SRAT patch first, and then do 
your work.

It will seem more clean to me.

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Ingo Molnar

* Borislav Petkov  wrote:

> On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
> > I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
> > there are any reasons to keep any 32-bit x86 NUMA code at all.
> 
> How much would it hurt us if we said 3.8 is the last kernel that supported 
> NUMAQ? 
> If anyone wants the functionality, they should use 3.8 or older.

v3.9 - any non-trivial patch in the stage of being contemplated near the end of 
the 
v3.9 merge window is most likely v3.10 material.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Borislav Petkov
On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
> I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
> there are any reasons to keep any 32-bit x86 NUMA code at all.

How much would it hurt us if we said 3.8 is the last kernel that
supported NUMAQ? If anyone wants the functionality, they should use 3.8
or older.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Ingo Molnar

* H. Peter Anvin  wrote:

> On 02/25/2013 08:51 PM, Martin Bligh wrote:
> >> Do you mean we can remove numaq x86 32bit code now?
> > 
> > Wouldn't bother me at all. The machine is from 1995, end of life c. 2000? 
> > Was 
> > useful in the early days of getting NUMA up and running on Linux, but is 
> > now too 
> > old to be a museum piece, really.
> 
> I'd be very happy to get the NUMAQ code ripped out.  I am wondering if there 
> are 
> any reasons to keep any 32-bit x86 NUMA code at all.

Not much I suspect.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yasuaki Ishimatsu

2013/03/01 17:02, Yinghai Lu wrote:

On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen  wrote:

On 03/01/2013 01:00 PM, Yinghai Lu wrote:


On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:


Yingai, Andrew,
   is this ok with you two?

  Linus



FWIW, it makes sense to me iff it resolves the problems




I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.

Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.



Hi Yinghai,

Would you please give me a url to your code ?

I don't think movablemem_map will block your work a lot. According to your
description, you are modifying memblock to reserve some memory for local
node pagetables, right ?





My idea:
current for hotadd mem, page table will from other nodes from slub.
that is not right. that will prevent others nodes to be hot removed.


If we use your idea, pglist_data and zone are also allocated from local
node. In my understanding, pglist_data and zone cannot be deleted safely
since there is no way to guarantee that nobody use them. So it means
that all nodes cannot be hot removed.
If you develop your idea, you should consider memory hot remove.

Thanks,
Yasuaki Ishimatsu



To fix the problem
a. make memblock still alive after booting.
b. or have separated dynamical memblock.

second way looks more clean.
so alloc_low_pages will get initial page for page table from low range
with slub.
and later will get page table from its own just mapped range.

Now need to make memblock more clean and remove hardcoded reference in
those functions.

Thanks

Yinghai




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:37 PM, H. Peter Anvin  wrote:
> On 02/25/2013 08:51 PM, Martin Bligh wrote:
>>> Do you mean we can remove numaq x86 32bit code now?
>>
>> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
>> Was useful in the early days of getting NUMA up and running on Linux,
>> but is now too old to be a museum piece, really.
>>
>
> I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
> there are any reasons to keep any 32-bit x86 NUMA code at all.

Agreed!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen  wrote:
> On 03/01/2013 01:00 PM, Yinghai Lu wrote:
>>
>> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>>
>>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:

 Yingai, Andrew,
   is this ok with you two?

  Linus
>>>
>>>
>>> FWIW, it makes sense to me iff it resolves the problems
>>
>>
>>
>> I prefer to reverting all 8 patches.
>>
>> Actually I have worked out one patch that could solve all problems, but it
>> is too intrusive that I do  not want to split it to small pieces to
>> post it.
>>
>> Leaving the movablemem_map related changes in  the upstream tree,
>> will prevent me from continuing to make memblock to be used to allocate
>> page table on local node ram for hot add.
>
>
> Hi Yinghai,
>
> Would you please give me a url to your code ?
>
> I don't think movablemem_map will block your work a lot. According to your
> description, you are modifying memblock to reserve some memory for local
> node pagetables, right ?

My idea:
current for hotadd mem, page table will from other nodes from slub.
that is not right. that will prevent others nodes to be hot removed.

To fix the problem
a. make memblock still alive after booting.
b. or have separated dynamical memblock.

second way looks more clean.
so alloc_low_pages will get initial page for page table from low range
with slub.
and later will get page table from its own just mapped range.

Now need to make memblock more clean and remove hardcoded reference in
those functions.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 03/01/2013 01:00 PM, Yinghai Lu wrote:

 On Thursday, February 28, 2013, H. Peter Anvin wrote:

 On 02/28/2013 08:32 PM, Linus Torvalds wrote:

 Yingai, Andrew,
   is this ok with you two?

  Linus


 FWIW, it makes sense to me iff it resolves the problems



 I prefer to reverting all 8 patches.

 Actually I have worked out one patch that could solve all problems, but it
 is too intrusive that I do  not want to split it to small pieces to
 post it.

 Leaving the movablemem_map related changes in  the upstream tree,
 will prevent me from continuing to make memblock to be used to allocate
 page table on local node ram for hot add.


 Hi Yinghai,

 Would you please give me a url to your code ?

 I don't think movablemem_map will block your work a lot. According to your
 description, you are modifying memblock to reserve some memory for local
 node pagetables, right ?

My idea:
current for hotadd mem, page table will from other nodes from slub.
that is not right. that will prevent others nodes to be hot removed.

To fix the problem
a. make memblock still alive after booting.
b. or have separated dynamical memblock.

second way looks more clean.
so alloc_low_pages will get initial page for page table from low range
with slub.
and later will get page table from its own just mapped range.

Now need to make memblock more clean and remove hardcoded reference in
those functions.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:37 PM, H. Peter Anvin h...@zytor.com wrote:
 On 02/25/2013 08:51 PM, Martin Bligh wrote:
 Do you mean we can remove numaq x86 32bit code now?

 Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
 Was useful in the early days of getting NUMA up and running on Linux,
 but is now too old to be a museum piece, really.


 I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
 there are any reasons to keep any 32-bit x86 NUMA code at all.

Agreed!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yasuaki Ishimatsu

2013/03/01 17:02, Yinghai Lu wrote:

On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen tangc...@cn.fujitsu.com wrote:

On 03/01/2013 01:00 PM, Yinghai Lu wrote:


On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:


Yingai, Andrew,
   is this ok with you two?

  Linus



FWIW, it makes sense to me iff it resolves the problems




I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.

Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.



Hi Yinghai,

Would you please give me a url to your code ?

I don't think movablemem_map will block your work a lot. According to your
description, you are modifying memblock to reserve some memory for local
node pagetables, right ?





My idea:
current for hotadd mem, page table will from other nodes from slub.
that is not right. that will prevent others nodes to be hot removed.


If we use your idea, pglist_data and zone are also allocated from local
node. In my understanding, pglist_data and zone cannot be deleted safely
since there is no way to guarantee that nobody use them. So it means
that all nodes cannot be hot removed.
If you develop your idea, you should consider memory hot remove.

Thanks,
Yasuaki Ishimatsu



To fix the problem
a. make memblock still alive after booting.
b. or have separated dynamical memblock.

second way looks more clean.
so alloc_low_pages will get initial page for page table from low range
with slub.
and later will get page table from its own just mapped range.

Now need to make memblock more clean and remove hardcoded reference in
those functions.

Thanks

Yinghai




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Ingo Molnar

* H. Peter Anvin h...@zytor.com wrote:

 On 02/25/2013 08:51 PM, Martin Bligh wrote:
  Do you mean we can remove numaq x86 32bit code now?
  
  Wouldn't bother me at all. The machine is from 1995, end of life c. 2000? 
  Was 
  useful in the early days of getting NUMA up and running on Linux, but is 
  now too 
  old to be a museum piece, really.
 
 I'd be very happy to get the NUMAQ code ripped out.  I am wondering if there 
 are 
 any reasons to keep any 32-bit x86 NUMA code at all.

Not much I suspect.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Borislav Petkov
On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
 I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
 there are any reasons to keep any 32-bit x86 NUMA code at all.

How much would it hurt us if we said 3.8 is the last kernel that
supported NUMAQ? If anyone wants the functionality, they should use 3.8
or older.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Ingo Molnar

* Borislav Petkov b...@alien8.de wrote:

 On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
  I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
  there are any reasons to keep any 32-bit x86 NUMA code at all.
 
 How much would it hurt us if we said 3.8 is the last kernel that supported 
 NUMAQ? 
 If anyone wants the functionality, they should use 3.8 or older.

v3.9 - any non-trivial patch in the stage of being contemplated near the end of 
the 
v3.9 merge window is most likely v3.10 material.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Tang Chen

On 03/01/2013 03:43 PM, Yinghai Lu wrote:

Please check attached patches.

Plan A. revert all 8 patches:
 revert_movablemem_map.patch

Plan B. fix movablemem_map:
 kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch

fix_movablemem_map.patch is too risky, and need more test.



Hi Yinghai,

In your Plan B, you allocated pgdat on local node, right ?

-nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+nd_pa = memblock_find_in_range_node(start, end, nd_size,
+ SMP_CACHE_BYTES, nid);   Here, 
right ?


Without movablemem_map, pgdat will be allocated successfully on local 
node, right ?


If so, this will prevent node hot-plug, because as mentioned by 
Kamezawa, there is

no way to ensure pgdat is not used by others on stack.

I do hope you can stop putting pgdat and zone on local node for now. And 
improve it

in the future.

And I also hope you can apply my revert SRAT patch first, and then do 
your work.

It will seem more clean to me.

Thanks. :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread H. Peter Anvin
If NUMAQ is breaking real stuff we can kill it by marking it BROKEN.  Rip-out 
is 3.10 at this stage.

Ingo Molnar mi...@kernel.org wrote:


* Borislav Petkov b...@alien8.de wrote:

 On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
  I'd be very happy to get the NUMAQ code ripped out.  I am wondering
if
  there are any reasons to keep any 32-bit x86 NUMA code at all.
 
 How much would it hurt us if we said 3.8 is the last kernel that
supported NUMAQ? 
 If anyone wants the functionality, they should use 3.8 or older.

v3.9 - any non-trivial patch in the stage of being contemplated near
the end of the 
v3.9 merge window is most likely v3.10 material.

Thanks,

   Ingo

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread H. Peter Anvin
On 02/28/2013 11:55 PM, Yinghai Lu wrote:
 
 Let me try again:
 
 movablemem_map is broken idea or poor design.
 

Very much so.  I have said this before: this is potentially useful
during development/testing, but anyone who expects to actually tell
their customers to use it is abusive.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 11:43 PM, Yinghai Lu ying...@kernel.org wrote:
 [trim down CC list a bit]

 On Thu, Feb 28, 2013 at 9:00 PM, Yinghai Lu ying...@kernel.org wrote:


 On Thursday, February 28, 2013, H. Peter Anvin wrote:

 On 02/28/2013 08:32 PM, Linus Torvalds wrote:
  Yingai, Andrew,
   is this ok with you two?
 
  Linus

 FWIW, it makes sense to me iff it resolves the problems


 I prefer to reverting all 8 patches.

 Actually I have worked out one patch that could solve all problems, but it
 is too intrusive that I do  not want to split it to small pieces to post
 it.

 Leaving the movablemem_map related changes in  the upstream tree, will
 prevent me from continuing to make memblock to be used to allocate page
 table on local node ram for hot add.

 Will send reverting patch and putting page table on local node patch around
 10pm after I get home.

 Please check attached patches.

 Plan A. revert all 8 patches:
 revert_movablemem_map.patch

 Plan B. fix movablemem_map:
 kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch

 fix_movablemem_map.patch is too risky, and need more test.

 Konrad, Stefano:
 Can you check kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch
 on top of today's Linus tree to check if it breaks Xen?


Sorry, miss change in setup.c during split the patch.

Thanks

Yinghai


fix_movablemem_map_v2.patch
Description: Binary data


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-03-01 Thread Yinghai Lu
On Fri, Mar 1, 2013 at 7:03 PM, chen tang imtangc...@gmail.com wrote:

 Thank you for your suggestion and fix work. :)
 I would prefer your Plan b. But one last thing I want to confirm:

 Will allocating pgdat and zone on local node prevent node hot-removing ?
 Or is it safe to free all node data when removing a node ?
 AFAIK, no way to ensure node data is not on thread stack.

Not sure. I need to go over the code.
That is slub's limitation.

If it is not, it should be fixed.


 If it is OK, I think Plan B is OK, and we can improve movablemem_map more in
 the future.

 BTW, I didn't mean to deny your idea and work. NUMA performance is always
 understand our  consideration.
 It's just we plan it as a long way development in the future.
 movablemem_map is very important to us. And we do hope to keep it in kernel
 now, and improve it later.

That does not look like right way to do development with mainline tree
to add new
features.

You don't need to put development/testing support patches in the mainline.
Just put those support patches in your local tree.

Everyone have bunch of development/debug/teststub patches in their own
hardisk for their working area, but don't need put them into mainline tree.

Good practice should be:
Have the feature completely done in your local tree and etc.
then send out several patchset. and get reviewed and get merged
one by one.

Sometime would turn out that your whole patchset has problem that
can not be fixed during review, and should be redesign again.

Mainline tree is NOT testbed.

For pci-root-bus hotplug, I already had code done completely.
Then send out patchset one by one to get completely review.
One patchset about acpi-scan is totally rewritten by Rafael after he understood
our needs with better and clean design.
Now still have ioapic and iommu left, and those patchset have been in
my local tree more than 6 months and I keep optimizing them.

BTW, Please do not top-post later.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:02 PM, Yasuaki Ishimatsu
 wrote:
> 2013/03/01 14:00, Yinghai Lu wrote:
>
> Original issue occurs by two patches. And it is fixed by Tang's reverting
> patch. So other patches are obviously unrelated to original problem. Thus
> there is no reason to revert all patches related with movablemem_map.
>
> If there is a reason, movablemem_map patches prevent only your work.
>
> If you keep on developing your work, you should develop it in consideration
> of those patches.

Let me try again:

movablemem_map is broken idea or poor design.

It just push down kernel memory from local node to some place.

It is ridiculous to let use specify mem range in command line to make
memory hotplug working.
Think about different memory layout conf, that will drive customer crazy.
Also not mention there is performance regarding put numa data low.

Right way or good pratice is:
Find out those kernel memory that can not be moved, either put them low
or make it to local node ram.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread H. Peter Anvin
On 02/25/2013 08:51 PM, Martin Bligh wrote:
>> Do you mean we can remove numaq x86 32bit code now?
> 
> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
> Was useful in the early days of getting NUMA up and running on Linux,
> but is now too old to be a museum piece, really.
> 

I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
there are any reasons to keep any 32-bit x86 NUMA code at all.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

On 03/01/2013 01:00 PM, Yinghai Lu wrote:

On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:

Yingai, Andrew,
  is this ok with you two?

 Linus


FWIW, it makes sense to me iff it resolves the problems



I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.

Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.


Hi Yinghai,

Would you please give me a url to your code ?

I don't think movablemem_map will block your work a lot. According to your
description, you are modifying memblock to reserve some memory for local
node pagetables, right ?

If so, I think it won't be too difficult to make the code OK with your work.

Thanks. :)



Will send reverting patch and putting page table on local node patch around
10pm after I get home.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yasuaki Ishimatsu

2013/03/01 14:00, Yinghai Lu wrote:

On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:

Yingai, Andrew,
  is this ok with you two?

 Linus


FWIW, it makes sense to me iff it resolves the problems



I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.




Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.


Original issue occurs by two patches. And it is fixed by Tang's reverting
patch. So other patches are obviously unrelated to original problem. Thus
there is no reason to revert all patches related with movablemem_map.

If there is a reason, movablemem_map patches prevent only your work.

If you keep on developing your work, you should develop it in consideration
of those patches.

Thanks,
Yasuaki Ishimatsu



Will send reverting patch and putting page table on local node patch around
10pm after I get home.

Thanks




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread H. Peter Anvin
On 02/28/2013 08:32 PM, Linus Torvalds wrote:
> Yingai, Andrew,
>  is this ok with you two?
> 
> Linus

FWIW, it makes sense to me iff it resolves the problems.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Andrew Morton
On Thu, 28 Feb 2013 20:32:15 -0800 Linus Torvalds 
 wrote:

> Yingai, Andrew,
>  is this ok with you two?

If it works.  I haven't tested it yet!  Ordinarily I'd give it a few
days for -next testing and to let Fengguang's testbot chew on it. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Linus Torvalds
Yingai, Andrew,
 is this ok with you two?

Linus

On Thu, Feb 28, 2013 at 7:46 PM, Tang Chen  wrote:
> Hi Linus,
>
> Please refer to the attached patch.
>
> This patch everts only the following two patches.
>
>
> commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
> acpi, memory-hotplug: support getting hotplug info from SRAT
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>
> acpi, memory-hotplug: parse SRAT before memblock is ready
>
> Without these two patches, users can use "movablemem_map=nn[KMG]@ss[KMG]"
> correctly, and cause no problem.
>
> And of course, the kernel will work as before if users don't use
>
> "movablemem_map=nn[KMG]@ss[KMG]".
>
> I do hope we can keep "movablemem_map=nn[KMG]@ss[KMG]" in 3.9.
>
>
> We are working on fixing the SRAT problems, and we aims to push SRAT related
> patches in 3.10. And we will also improve "movablemem_map=nn[KMG]@ss[KMG]"
> functionality consistently in the future.
>
> Thanks. :)
>
>
> On 03/01/2013 11:13 AM, Linus Torvalds wrote:
>>
>> On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
>>   wrote:
>>>
>>>
>>> So I'm thinking that the best approach here is to revert everything and
>>> then try again for 3.10-rc1.  This gives people time to test the code
>>> while it's only in linux-next.  (Hint!)
>>
>>
>> I'd prefer to revert too by now - the bug seems to be known, and
>> apparently it's not a trivial fix. We're getting close to the end of
>> the merge window, and it's still being discussed, it clearly wasn't
>> really fully cooked.
>>
>> Can we agree on some minimal set of reverts? Can somebody send me a
>> patch with the revert and the commit explanation for the revert?
>> Yinghai? Or I can do the reverts too if just the exact set of commits
>> is clear, but I'd rather get it from somebody who sees and understand
>> the problem, and can test the state afterwards..
>>
>> Linus
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

Hi Linus,

Please refer to the attached patch.

This patch everts only the following two patches.

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready

Without these two patches, users can use "movablemem_map=nn[KMG]@ss[KMG]"
correctly, and cause no problem.

And of course, the kernel will work as before if users don't use
"movablemem_map=nn[KMG]@ss[KMG]".

I do hope we can keep "movablemem_map=nn[KMG]@ss[KMG]" in 3.9.


We are working on fixing the SRAT problems, and we aims to push SRAT related
patches in 3.10. And we will also improve "movablemem_map=nn[KMG]@ss[KMG]"
functionality consistently in the future.

Thanks. :)

On 03/01/2013 11:13 AM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
  wrote:


So I'm thinking that the best approach here is to revert everything and
then try again for 3.10-rc1.  This gives people time to test the code
while it's only in linux-next.  (Hint!)


I'd prefer to revert too by now - the bug seems to be known, and
apparently it's not a trivial fix. We're getting close to the end of
the merge window, and it's still being discussed, it clearly wasn't
really fully cooked.

Can we agree on some minimal set of reverts? Can somebody send me a
patch with the revert and the commit explanation for the revert?
Yinghai? Or I can do the reverts too if just the exact set of commits
is clear, but I'd rather get it from somebody who sees and understand
the problem, and can test the state afterwards..

Linus

>From 2e859dc212ce13fb812da6f971409a0518914574 Mon Sep 17 00:00:00 2001
From: Tang Chen 
Date: Thu, 28 Feb 2013 10:43:51 +0900
Subject: [PATCH] x86, ACPI, mm: Revert SRAT support from movablemem_map boot option.

The following two commits suooprt getting info from SRAT and determine
which memory is hot-pluggable, also AKA "movablemem_map=srat" boot option.

	commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
		acpi, memory-hotplug: support getting hotplug info from SRAT
	commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
		acpi, memory-hotplug: parse SRAT before memblock is ready

We need to know SRAT info before memblock is ready, so that we can
prevent memblock from allocate movable memory.

To achieve goal, we moved SRAT parsing code earlier in these patches. But it broke
ACPI_INITRD_TABLE_OVERRIDE functionality, and the fallback path of numa_init().

So we revert these two commits for now. And after that, users can only use
"movablemem_map=nn[KMG]@ss[KMG]".

NOTE: 
1) It is OK to revert only these two patches. The core problems mentioned by
   Lu Yinghai:
   1. numa_init is called several times, NOT just for srat. so those
	nodes_clear(numa_nodes_parsed)
	memset(_meminfo, 0, sizeof(numa_meminfo))
  can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
  and make fall back path working.
   2. simply split acpi_numa_init to early_parse_srat.
  a. that early_parse_srat is NOT called for ia64, so you break ia64.
  b. for (i = 0; i < MAX_LOCAL_APIC; i++)
	set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from early_parse_srat.
 it should be moved before that
  c.  it breaks ACPI_TABLE_OVERRIDE...as the acpi table scan is moved
  early before override from INITRD is settled.

   They are caused by moving SRAT parsing earlier. And "movablemem_map=nn[KMG]@ss[KMG]" 
   causes no harm to kernel.

2) With these two patches reverted, memblock will start to work before we parse SRAT,
   which means we won't know the end address of each node early enough.

   For example:
   If one node has memory [10G, 20G), and user specifies [15G, 16G), we cannot extend
   it to [15G, 20G). So memblock could still have a chance to allocate memory from
   [16G, 20G) for kernel, which is non-movable.

   As a resule, users could only use this option in a very limit way: 
   They should specify the memory range to the end of each node.

Reported-by: Tim Gardner 
Reported-by: Don Morris 
Bisected-by: Don Morris 
Reported-by: Yinghai Lu 
Signed-off-by: Tang Chen 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Andrew Morton 
Cc: Tony Luck 
Cc: Thomas Renninger 
Cc: Tejun Heo 
Cc: Tang Chen 
Cc: Yasuaki Ishimatsu 
---
 Documentation/kernel-parameters.txt |   29 ++
 arch/x86/kernel/setup.c |   13 ++
 arch/x86/mm/numa.c  |6 +--
 arch/x86/mm/srat.c  |   71 ++
 drivers/acpi/numa.c |   23 +--
 include/linux/acpi.h|8 
 include/linux/mm.h  |2 -
 mm/page_alloc.c |   22 +--
 8 files changed, 27 insertions(+), 147 deletions(-)

diff --git 

Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
 wrote:
>
> So I'm thinking that the best approach here is to revert everything and
> then try again for 3.10-rc1.  This gives people time to test the code
> while it's only in linux-next.  (Hint!)

I'd prefer to revert too by now - the bug seems to be known, and
apparently it's not a trivial fix. We're getting close to the end of
the merge window, and it's still being discussed, it clearly wasn't
really fully cooked.

Can we agree on some minimal set of reverts? Can somebody send me a
patch with the revert and the commit explanation for the revert?
Yinghai? Or I can do the reverts too if just the exact set of commits
is clear, but I'd rather get it from somebody who sees and understand
the problem, and can test the state afterwards..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

On 03/01/2013 12:07 AM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 11:44 PM, Tang Chen  wrote:


Sorry, if you want to revert, you just need to revert:

  commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
   acpi, memory-hotplug: parse SRAT before memblock is ready
  commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
   acpi, memory-hotplug: support getting hotplug info from SRAT

The other two have nothing to do with SRAT. And they are necessary.

Seeing from the code, I think it is clean. But we'd better test it.


We should revert them all.

as

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
Author: Tang Chen
Date:   Fri Feb 22 16:33:42 2013 -0800

 page_alloc: bootmem limit with movablecore_map

It is totally misleading in the TITLE. Come on, what is movablecore_map?

It actually use movablemem_map to exclude some range during
memblock_find_in_range.

That make memblock less generic.

That patch is the base of the whole patchset.

Also you and Yasuaki keep saying: movablemem_map=srat.
But where is doc and code for it?
Looks like there is only movablemem_map=acpi.


Hi Yinghai,

I think I forgot to change the title when merging the related bugfix patches
into one. And yes, movablecore_map has been changed to movablemem_map.

How about this:

For now, let's revert the SRAT related patch, and keep 
movablecore_map=nn[KMG]@ss[KMG].


About the SRAT thing, we have the following solution:

1) keep the original init series, parse acpi tables and modify global 
variables as before
2) introduce a new function to obtain SRAT info earlier, store the info 
somewhere,

   and touch no numa related thing
3) use the info to do movablemem_map thing, and free them when it is done

In this way, we keep our code isolated from numa code. And the numa will 
be initialized as before.
This can be done in one week or faster. And I'll cc x86 guys, and they 
can choose whenever

to merge the new code.


And about movablecore_map=nn[KMG]@ss[KMG] code, there is no harm to the 
kernel. And we
have documented it that using this option will cause numa performance 
down. And users who
don't want to lose the numa performance can boot the kernel without this 
option, and the

kernel will work as before.

I do hope we can keep the code in 3.9, and do more improvement in the 
future.

So please just revert the two SRAT related patches.


Thanks. :)



I'm upset by this patchset.

Next time, please get Ack from TJ or Ben when you touch memblock code.
And at least make the TITLE is right.

Thanks

Yinghai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 11:44 PM, Tang Chen  wrote:
>
> Sorry, if you want to revert, you just need to revert:
>
>  commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>   acpi, memory-hotplug: parse SRAT before memblock is ready
>  commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
>   acpi, memory-hotplug: support getting hotplug info from SRAT
>
> The other two have nothing to do with SRAT. And they are necessary.
>
> Seeing from the code, I think it is clean. But we'd better test it.

We should revert them all.

as

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
Author: Tang Chen 
Date:   Fri Feb 22 16:33:42 2013 -0800

page_alloc: bootmem limit with movablecore_map

It is totally misleading in the TITLE. Come on, what is movablecore_map?

It actually use movablemem_map to exclude some range during
memblock_find_in_range.

That make memblock less generic.

That patch is the base of the whole patchset.

Also you and Yasuaki keep saying: movablemem_map=srat.
But where is doc and code for it?
Looks like there is only movablemem_map=acpi.

I'm upset by this patchset.

Next time, please get Ack from TJ or Ben when you touch memblock code.
And at least make the TITLE is right.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

Hi Andrew,

On 02/28/2013 05:26 AM, Andrew Morton wrote:

Thank you all for addressing the bug. we are on the way to fix it.


How long do you think this will take?



I think we need one week to solve these problems. I do hope we can catch up
the merge window for 3.9.

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

Hi Andrew,

On 02/28/2013 05:26 AM, Andrew Morton wrote:

Thank you all for addressing the bug. we are on the way to fix it.


How long do you think this will take?



I think we need one week to solve these problems. I do hope we can catch up
the merge window for 3.9.

Thanks. :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 11:44 PM, Tang Chen tangc...@cn.fujitsu.com wrote:

 Sorry, if you want to revert, you just need to revert:

  commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
   acpi, memory-hotplug: parse SRAT before memblock is ready
  commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
   acpi, memory-hotplug: support getting hotplug info from SRAT

 The other two have nothing to do with SRAT. And they are necessary.

 Seeing from the code, I think it is clean. But we'd better test it.

We should revert them all.

as

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
Author: Tang Chen tangc...@cn.fujitsu.com
Date:   Fri Feb 22 16:33:42 2013 -0800

page_alloc: bootmem limit with movablecore_map

It is totally misleading in the TITLE. Come on, what is movablecore_map?

It actually use movablemem_map to exclude some range during
memblock_find_in_range.

That make memblock less generic.

That patch is the base of the whole patchset.

Also you and Yasuaki keep saying: movablemem_map=srat.
But where is doc and code for it?
Looks like there is only movablemem_map=acpi.

I'm upset by this patchset.

Next time, please get Ack from TJ or Ben when you touch memblock code.
And at least make the TITLE is right.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

On 03/01/2013 12:07 AM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 11:44 PM, Tang Chentangc...@cn.fujitsu.com  wrote:


Sorry, if you want to revert, you just need to revert:

  commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
   acpi, memory-hotplug: parse SRAT before memblock is ready
  commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
   acpi, memory-hotplug: support getting hotplug info from SRAT

The other two have nothing to do with SRAT. And they are necessary.

Seeing from the code, I think it is clean. But we'd better test it.


We should revert them all.

as

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
Author: Tang Chentangc...@cn.fujitsu.com
Date:   Fri Feb 22 16:33:42 2013 -0800

 page_alloc: bootmem limit with movablecore_map

It is totally misleading in the TITLE. Come on, what is movablecore_map?

It actually use movablemem_map to exclude some range during
memblock_find_in_range.

That make memblock less generic.

That patch is the base of the whole patchset.

Also you and Yasuaki keep saying: movablemem_map=srat.
But where is doc and code for it?
Looks like there is only movablemem_map=acpi.


Hi Yinghai,

I think I forgot to change the title when merging the related bugfix patches
into one. And yes, movablecore_map has been changed to movablemem_map.

How about this:

For now, let's revert the SRAT related patch, and keep 
movablecore_map=nn[KMG]@ss[KMG].


About the SRAT thing, we have the following solution:

1) keep the original init series, parse acpi tables and modify global 
variables as before
2) introduce a new function to obtain SRAT info earlier, store the info 
somewhere,

   and touch no numa related thing
3) use the info to do movablemem_map thing, and free them when it is done

In this way, we keep our code isolated from numa code. And the numa will 
be initialized as before.
This can be done in one week or faster. And I'll cc x86 guys, and they 
can choose whenever

to merge the new code.


And about movablecore_map=nn[KMG]@ss[KMG] code, there is no harm to the 
kernel. And we
have documented it that using this option will cause numa performance 
down. And users who
don't want to lose the numa performance can boot the kernel without this 
option, and the

kernel will work as before.

I do hope we can keep the code in 3.9, and do more improvement in the 
future.

So please just revert the two SRAT related patches.


Thanks. :)



I'm upset by this patchset.

Next time, please get Ack from TJ or Ben when you touch memblock code.
And at least make the TITLE is right.

Thanks

Yinghai


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
a...@linux-foundation.org wrote:

 So I'm thinking that the best approach here is to revert everything and
 then try again for 3.10-rc1.  This gives people time to test the code
 while it's only in linux-next.  (Hint!)

I'd prefer to revert too by now - the bug seems to be known, and
apparently it's not a trivial fix. We're getting close to the end of
the merge window, and it's still being discussed, it clearly wasn't
really fully cooked.

Can we agree on some minimal set of reverts? Can somebody send me a
patch with the revert and the commit explanation for the revert?
Yinghai? Or I can do the reverts too if just the exact set of commits
is clear, but I'd rather get it from somebody who sees and understand
the problem, and can test the state afterwards..

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

Hi Linus,

Please refer to the attached patch.

This patch everts only the following two patches.

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready

Without these two patches, users can use movablemem_map=nn[KMG]@ss[KMG]
correctly, and cause no problem.

And of course, the kernel will work as before if users don't use
movablemem_map=nn[KMG]@ss[KMG].

I do hope we can keep movablemem_map=nn[KMG]@ss[KMG] in 3.9.


We are working on fixing the SRAT problems, and we aims to push SRAT related
patches in 3.10. And we will also improve movablemem_map=nn[KMG]@ss[KMG]
functionality consistently in the future.

Thanks. :)

On 03/01/2013 11:13 AM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
a...@linux-foundation.org  wrote:


So I'm thinking that the best approach here is to revert everything and
then try again for 3.10-rc1.  This gives people time to test the code
while it's only in linux-next.  (Hint!)


I'd prefer to revert too by now - the bug seems to be known, and
apparently it's not a trivial fix. We're getting close to the end of
the merge window, and it's still being discussed, it clearly wasn't
really fully cooked.

Can we agree on some minimal set of reverts? Can somebody send me a
patch with the revert and the commit explanation for the revert?
Yinghai? Or I can do the reverts too if just the exact set of commits
is clear, but I'd rather get it from somebody who sees and understand
the problem, and can test the state afterwards..

Linus

From 2e859dc212ce13fb812da6f971409a0518914574 Mon Sep 17 00:00:00 2001
From: Tang Chen tangc...@cn.fujitsu.com
Date: Thu, 28 Feb 2013 10:43:51 +0900
Subject: [PATCH] x86, ACPI, mm: Revert SRAT support from movablemem_map boot option.

The following two commits suooprt getting info from SRAT and determine
which memory is hot-pluggable, also AKA movablemem_map=srat boot option.

	commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
		acpi, memory-hotplug: support getting hotplug info from SRAT
	commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
		acpi, memory-hotplug: parse SRAT before memblock is ready

We need to know SRAT info before memblock is ready, so that we can
prevent memblock from allocate movable memory.

To achieve goal, we moved SRAT parsing code earlier in these patches. But it broke
ACPI_INITRD_TABLE_OVERRIDE functionality, and the fallback path of numa_init().

So we revert these two commits for now. And after that, users can only use
movablemem_map=nn[KMG]@ss[KMG].

NOTE: 
1) It is OK to revert only these two patches. The core problems mentioned by
   Lu Yinghai:
   1. numa_init is called several times, NOT just for srat. so those
	nodes_clear(numa_nodes_parsed)
	memset(numa_meminfo, 0, sizeof(numa_meminfo))
  can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
  and make fall back path working.
   2. simply split acpi_numa_init to early_parse_srat.
  a. that early_parse_srat is NOT called for ia64, so you break ia64.
  b. for (i = 0; i  MAX_LOCAL_APIC; i++)
	set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from early_parse_srat.
 it should be moved before that
  c.  it breaks ACPI_TABLE_OVERRIDE...as the acpi table scan is moved
  early before override from INITRD is settled.

   They are caused by moving SRAT parsing earlier. And movablemem_map=nn[KMG]@ss[KMG] 
   causes no harm to kernel.

2) With these two patches reverted, memblock will start to work before we parse SRAT,
   which means we won't know the end address of each node early enough.

   For example:
   If one node has memory [10G, 20G), and user specifies [15G, 16G), we cannot extend
   it to [15G, 20G). So memblock could still have a chance to allocate memory from
   [16G, 20G) for kernel, which is non-movable.

   As a resule, users could only use this option in a very limit way: 
   They should specify the memory range to the end of each node.

Reported-by: Tim Gardner tim.gard...@canonical.com
Reported-by: Don Morris don.mor...@hp.com
Bisected-by: Don Morris don.mor...@hp.com
Reported-by: Yinghai Lu ying...@kernel.org
Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: Tony Luck tony.l...@intel.com
Cc: Thomas Renninger tr...@suse.de
Cc: Tejun Heo t...@kernel.org
Cc: Tang Chen tangc...@cn.fujitsu.com
Cc: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
---
 Documentation/kernel-parameters.txt |   29 ++
 arch/x86/kernel/setup.c |   13 ++
 arch/x86/mm/numa.c  |6 +--
 arch/x86/mm/srat.c  |   71 

Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Linus Torvalds
Yingai, Andrew,
 is this ok with you two?

Linus

On Thu, Feb 28, 2013 at 7:46 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 Hi Linus,

 Please refer to the attached patch.

 This patch everts only the following two patches.


 commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
 acpi, memory-hotplug: support getting hotplug info from SRAT
 commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f

 acpi, memory-hotplug: parse SRAT before memblock is ready

 Without these two patches, users can use movablemem_map=nn[KMG]@ss[KMG]
 correctly, and cause no problem.

 And of course, the kernel will work as before if users don't use

 movablemem_map=nn[KMG]@ss[KMG].

 I do hope we can keep movablemem_map=nn[KMG]@ss[KMG] in 3.9.


 We are working on fixing the SRAT problems, and we aims to push SRAT related
 patches in 3.10. And we will also improve movablemem_map=nn[KMG]@ss[KMG]
 functionality consistently in the future.

 Thanks. :)


 On 03/01/2013 11:13 AM, Linus Torvalds wrote:

 On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
 a...@linux-foundation.org  wrote:


 So I'm thinking that the best approach here is to revert everything and
 then try again for 3.10-rc1.  This gives people time to test the code
 while it's only in linux-next.  (Hint!)


 I'd prefer to revert too by now - the bug seems to be known, and
 apparently it's not a trivial fix. We're getting close to the end of
 the merge window, and it's still being discussed, it clearly wasn't
 really fully cooked.

 Can we agree on some minimal set of reverts? Can somebody send me a
 patch with the revert and the commit explanation for the revert?
 Yinghai? Or I can do the reverts too if just the exact set of commits
 is clear, but I'd rather get it from somebody who sees and understand
 the problem, and can test the state afterwards..

 Linus


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Andrew Morton
On Thu, 28 Feb 2013 20:32:15 -0800 Linus Torvalds 
torva...@linux-foundation.org wrote:

 Yingai, Andrew,
  is this ok with you two?

If it works.  I haven't tested it yet!  Ordinarily I'd give it a few
days for -next testing and to let Fengguang's testbot chew on it. 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread H. Peter Anvin
On 02/28/2013 08:32 PM, Linus Torvalds wrote:
 Yingai, Andrew,
  is this ok with you two?
 
 Linus

FWIW, it makes sense to me iff it resolves the problems.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yasuaki Ishimatsu

2013/03/01 14:00, Yinghai Lu wrote:

On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:

Yingai, Andrew,
  is this ok with you two?

 Linus


FWIW, it makes sense to me iff it resolves the problems



I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.




Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.


Original issue occurs by two patches. And it is fixed by Tang's reverting
patch. So other patches are obviously unrelated to original problem. Thus
there is no reason to revert all patches related with movablemem_map.

If there is a reason, movablemem_map patches prevent only your work.

If you keep on developing your work, you should develop it in consideration
of those patches.

Thanks,
Yasuaki Ishimatsu



Will send reverting patch and putting page table on local node patch around
10pm after I get home.

Thanks




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Tang Chen

On 03/01/2013 01:00 PM, Yinghai Lu wrote:

On Thursday, February 28, 2013, H. Peter Anvin wrote:


On 02/28/2013 08:32 PM, Linus Torvalds wrote:

Yingai, Andrew,
  is this ok with you two?

 Linus


FWIW, it makes sense to me iff it resolves the problems



I prefer to reverting all 8 patches.

Actually I have worked out one patch that could solve all problems, but it
is too intrusive that I do  not want to split it to small pieces to
post it.

Leaving the movablemem_map related changes in  the upstream tree,
will prevent me from continuing to make memblock to be used to allocate
page table on local node ram for hot add.


Hi Yinghai,

Would you please give me a url to your code ?

I don't think movablemem_map will block your work a lot. According to your
description, you are modifying memblock to reserve some memory for local
node pagetables, right ?

If so, I think it won't be too difficult to make the code OK with your work.

Thanks. :)



Will send reverting patch and putting page table on local node patch around
10pm after I get home.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread H. Peter Anvin
On 02/25/2013 08:51 PM, Martin Bligh wrote:
 Do you mean we can remove numaq x86 32bit code now?
 
 Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
 Was useful in the early days of getting NUMA up and running on Linux,
 but is now too old to be a museum piece, really.
 

I'd be very happy to get the NUMAQ code ripped out.  I am wondering if
there are any reasons to keep any 32-bit x86 NUMA code at all.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-28 Thread Yinghai Lu
On Thu, Feb 28, 2013 at 10:02 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:
 2013/03/01 14:00, Yinghai Lu wrote:

 Original issue occurs by two patches. And it is fixed by Tang's reverting
 patch. So other patches are obviously unrelated to original problem. Thus
 there is no reason to revert all patches related with movablemem_map.

 If there is a reason, movablemem_map patches prevent only your work.

 If you keep on developing your work, you should develop it in consideration
 of those patches.

Let me try again:

movablemem_map is broken idea or poor design.

It just push down kernel memory from local node to some place.

It is ridiculous to let use specify mem range in command line to make
memory hotplug working.
Think about different memory layout conf, that will drive customer crazy.
Also not mention there is performance regarding put numa data low.

Right way or good pratice is:
Find out those kernel memory that can not be moved, either put them low
or make it to local node ram.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Andrew Morton
On Wed, 27 Feb 2013 16:00:36 +0800
Lai Jiangshan  wrote:

> In the mails and the changlog of the revert-patch, I think Yinghai
> mainly worries about 3 problems.
> 
> 1) the current implement has bug and bad code.
> 
>   Yes. Any bug should be fixed. we should fix it directly, or
>   we can revert the related patches and then send the fixed patches.
> 
>   But the related patch is only one or two, it is not good idea
>   to revert the whole patchset or the whole feature. Right?

Reverting a new patchset isn't really a big deal.  The patchset gets
fixed up, retested then reapplied.  We like to do things this way
because it minimises the amount of trouble which the regression is
causing other people.

Reverting one or two patches from a fairly large and complex patchset
sounds risky - we're putting an untested patch combination straight
into mainline with minimal testing.  It would be safer to revert
everything.

So I'm thinking that the best approach here is to revert everything and
then try again for 3.10-rc1.  This gives people time to test the code
while it's only in linux-next.  (Hint!)

>   Thank you all for addressing the bug. we are on the way to fix it.

How long do you think this will take?

> 2) many memory can be put into hotplugable memory, but we have not yet moved 
> them
>into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.
> 
>   This is a restriction in the currently kernel, we can't convert them 
> quickly.
>   we must convert them step by step. example, we are converting the 
> memory of
>   page_cgroup to hotplugable memory.
> 
> 
> 3) if the user(or firmware) specify the un-hotplugable memory too small, the 
> system can't
>work, even can't boot.
> 
>   Any feature/system has its own minimum requirements, the user should
>   meet the requirements and specify more un-hotplugable memory.
>   so I don't think it is a problem in kernel land.
> 
>   But the problem 2)(above) make this feature's "minimum requirements"
>   much higher. It is the real thing that Yinghai worries about.
> 
>   But all systems which use this feature can offer this higher requirement
>   very easily. The users should specify enough un-hotplugable memory
>   before and after we decrease the "minimum requirements".
> 
>   The whole feature works very well if the user specify enough
>   un-hotplugable memory. So the problem 2) and 3) are not urgent
>   problems.

Yes, let's not mingle concepts.  From a feature perspective we've
always understood that 3.9 memory hotplug would be "has limitations,
needs work, but better than it was before".  Let's consider that
separately from "your patchset broke my kernel".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Luck, Tony
>   b. it will be freed to slub before run time.
>   like init code and initrd disk.

If this is a problem - I'd be inclined to disable the code that frees it. It's 
only
a few hundred KB of code, and possibly a few MB of initrd. Too small to
worry about on a hot pluggable server.

> In that case, so they should just boot system with numa=off.

But we will still care about NUMA locality.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Yinghai Lu
On Wed, Feb 27, 2013 at 8:28 AM, Luck, Tony  wrote:
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
>
> That doesn't seem to be a very realistic assumption. Can you even still buy 1G
> DIMMs for servers?   I'd think that a minimum would be to have each of four
> channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
> rather low.

We could use memmap= to exclude mem, right?

>
> I think that making sure that the system can boot is good (and maybe it should
> ignore/override[*] parameters that would prevent booting). But let's be 
> realistic
> about the cases we actually have to deal with (before somebody comes and talks
> about systems with just 16MB).

About make memory hotplug working:
1. find out ram that is used by kernel in early time.
2. check if
   a. it is with kernel code that will not be moved.
   like real_mode.
   b. it will be freed to slub before run time.
   like init code and initrd disk.
   c. if it is on local node ram that will not prevent mem hot-remove
   like page table and vmemmap.
   current we already have vmemmap and node_data on local node.
   May need to put page table on local node too. or just put page
   table with local node that kernel is on.
   d. something could be anywhere, and could be moved down after
   slub is ready.

movablemem_map patchset prevents kernel using kernel from local node.



In that case, so they should just boot system with numa=off.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Luck, Tony
> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram

That doesn't seem to be a very realistic assumption. Can you even still buy 1G
DIMMs for servers?   I'd think that a minimum would be to have each of four
channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
rather low.

I think that making sure that the system can boot is good (and maybe it should
ignore/override[*] parameters that would prevent booting). But let's be 
realistic
about the cases we actually have to deal with (before somebody comes and talks
about systems with just 16MB).

-Tony

[*] with some noisy warnings in the console log
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Don Morris
On 02/27/2013 12:11 AM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
>  wrote:
>> 2013/02/27 13:04, Yinghai Lu wrote:
>>>
>>> On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
>>>  wrote:

 2013/02/27 11:30, Yinghai Lu wrote:
>
> Do you mean you can not boot one socket system with 1G ram ?
> Assume socket 0 does not support hotplug, other 31 sockets support hot
> plug.
>
> So we could boot system only with socket0, and later one by one hot
> add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.

>>>
>>> yes, it may fail, if the one node memory need page table and vmemmap
>>> is more than 1g ...
>>>
>>
>>> for hot add memory we need to
>>> 1. add another wrapper for init_memory_mapping, just like
>>> init_mem_mapping() for booting path.
>>> 2. we need make memblock more generic, so we can use it with hot add
>>> memory during runtime.
>>> 3. with that we can initialize page table for hot added node with ram.
>>> a. initial page table for 2M near node top is from node0 ( that does
>>> not support hot plug).
>>> b. then will use 2M for memory below node top...
>>> c. with that we will make sure page table stay on local node.
>>>   alloc_low_pages need to be updated to support that.
>>> 4. need to make sure vmemmap on local node too.
>>
>>
>> I think so too. By this, memory hot plug becomes more useful.
>>
>>>
>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
> 
> Tang,  Yasuaki, Andrew,
> 
> Please check if you are ok with attached reverting patch.
> 
> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?

I'm sure from the discussion on how to leave in memory hotplug it
likely won't be just a clean reversion, but as a data point -- yes,
this patch does remove the problem as expected (and I don't see
any new ones at first glance... though I'm not trying hotplug yet
obviously).

Thanks,
Don Morris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Don Morris
On 02/27/2013 12:11 AM, Yinghai Lu wrote:
 On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 13:04, Yinghai Lu wrote:

 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:

 2013/02/27 11:30, Yinghai Lu wrote:

 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...


 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.


 I think so too. By this, memory hot plug becomes more useful.


 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.
 
 Tang,  Yasuaki, Andrew,
 
 Please check if you are ok with attached reverting patch.
 
 Tim, Don,
 Can you try if attached reverting patch fix all the problems for you ?

I'm sure from the discussion on how to leave in memory hotplug it
likely won't be just a clean reversion, but as a data point -- yes,
this patch does remove the problem as expected (and I don't see
any new ones at first glance... though I'm not trying hotplug yet
obviously).

Thanks,
Don Morris


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Luck, Tony
 assume first cpu only have 1G ram, and other 31 socket will have bunch of ram

That doesn't seem to be a very realistic assumption. Can you even still buy 1G
DIMMs for servers?   I'd think that a minimum would be to have each of four
channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
rather low.

I think that making sure that the system can boot is good (and maybe it should
ignore/override[*] parameters that would prevent booting). But let's be 
realistic
about the cases we actually have to deal with (before somebody comes and talks
about systems with just 16MB).

-Tony

[*] with some noisy warnings in the console log
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Yinghai Lu
On Wed, Feb 27, 2013 at 8:28 AM, Luck, Tony tony.l...@intel.com wrote:
 assume first cpu only have 1G ram, and other 31 socket will have bunch of ram

 That doesn't seem to be a very realistic assumption. Can you even still buy 1G
 DIMMs for servers?   I'd think that a minimum would be to have each of four
 channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
 rather low.

We could use memmap= to exclude mem, right?


 I think that making sure that the system can boot is good (and maybe it should
 ignore/override[*] parameters that would prevent booting). But let's be 
 realistic
 about the cases we actually have to deal with (before somebody comes and talks
 about systems with just 16MB).

About make memory hotplug working:
1. find out ram that is used by kernel in early time.
2. check if
   a. it is with kernel code that will not be moved.
   like real_mode.
   b. it will be freed to slub before run time.
   like init code and initrd disk.
   c. if it is on local node ram that will not prevent mem hot-remove
   like page table and vmemmap.
   current we already have vmemmap and node_data on local node.
   May need to put page table on local node too. or just put page
   table with local node that kernel is on.
   d. something could be anywhere, and could be moved down after
   slub is ready.

movablemem_map patchset prevents kernel using kernel from local node.



In that case, so they should just boot system with numa=off.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Luck, Tony
   b. it will be freed to slub before run time.
   like init code and initrd disk.

If this is a problem - I'd be inclined to disable the code that frees it. It's 
only
a few hundred KB of code, and possibly a few MB of initrd. Too small to
worry about on a hot pluggable server.

 In that case, so they should just boot system with numa=off.

But we will still care about NUMA locality.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Andrew Morton
On Wed, 27 Feb 2013 16:00:36 +0800
Lai Jiangshan la...@cn.fujitsu.com wrote:

 In the mails and the changlog of the revert-patch, I think Yinghai
 mainly worries about 3 problems.
 
 1) the current implement has bug and bad code.
 
   Yes. Any bug should be fixed. we should fix it directly, or
   we can revert the related patches and then send the fixed patches.
 
   But the related patch is only one or two, it is not good idea
   to revert the whole patchset or the whole feature. Right?

Reverting a new patchset isn't really a big deal.  The patchset gets
fixed up, retested then reapplied.  We like to do things this way
because it minimises the amount of trouble which the regression is
causing other people.

Reverting one or two patches from a fairly large and complex patchset
sounds risky - we're putting an untested patch combination straight
into mainline with minimal testing.  It would be safer to revert
everything.

So I'm thinking that the best approach here is to revert everything and
then try again for 3.10-rc1.  This gives people time to test the code
while it's only in linux-next.  (Hint!)

   Thank you all for addressing the bug. we are on the way to fix it.

How long do you think this will take?

 2) many memory can be put into hotplugable memory, but we have not yet moved 
 them
into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.
 
   This is a restriction in the currently kernel, we can't convert them 
 quickly.
   we must convert them step by step. example, we are converting the 
 memory of
   page_cgroup to hotplugable memory.
 
 
 3) if the user(or firmware) specify the un-hotplugable memory too small, the 
 system can't
work, even can't boot.
 
   Any feature/system has its own minimum requirements, the user should
   meet the requirements and specify more un-hotplugable memory.
   so I don't think it is a problem in kernel land.
 
   But the problem 2)(above) make this feature's minimum requirements
   much higher. It is the real thing that Yinghai worries about.
 
   But all systems which use this feature can offer this higher requirement
   very easily. The users should specify enough un-hotplugable memory
   before and after we decrease the minimum requirements.
 
   The whole feature works very well if the user specify enough
   un-hotplugable memory. So the problem 2) and 3) are not urgent
   problems.

Yes, let's not mingle concepts.  From a feature perspective we've
always understood that 3.9 memory hotplug would be has limitations,
needs work, but better than it was before.  Let's consider that
separately from your patchset broke my kernel.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Lai Jiangshan
On 02/27/2013 01:11 PM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
>  wrote:
>> 2013/02/27 13:04, Yinghai Lu wrote:
>>>
>>> On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
>>>  wrote:

 2013/02/27 11:30, Yinghai Lu wrote:
>
> Do you mean you can not boot one socket system with 1G ram ?
> Assume socket 0 does not support hotplug, other 31 sockets support hot
> plug.
>
> So we could boot system only with socket0, and later one by one hot
> add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.

>>>
>>> yes, it may fail, if the one node memory need page table and vmemmap
>>> is more than 1g ...
>>>
>>
>>> for hot add memory we need to
>>> 1. add another wrapper for init_memory_mapping, just like
>>> init_mem_mapping() for booting path.
>>> 2. we need make memblock more generic, so we can use it with hot add
>>> memory during runtime.
>>> 3. with that we can initialize page table for hot added node with ram.
>>> a. initial page table for 2M near node top is from node0 ( that does
>>> not support hot plug).
>>> b. then will use 2M for memory below node top...
>>> c. with that we will make sure page table stay on local node.
>>>   alloc_low_pages need to be updated to support that.
>>> 4. need to make sure vmemmap on local node too.
>>
>>
>> I think so too. By this, memory hot plug becomes more useful.
>>
>>>
>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
> 
> Tang,  Yasuaki, Andrew,
> 
> Please check if you are ok with attached reverting patch.
> 
> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?
> 


Hi, Yinghai, Andrew

In the mails and the changlog of the revert-patch, I think Yinghai
mainly worries about 3 problems.

1) the current implement has bug and bad code.

Yes. Any bug should be fixed. we should fix it directly, or
we can revert the related patches and then send the fixed patches.

But the related patch is only one or two, it is not good idea
to revert the whole patchset or the whole feature. Right?

Thank you all for addressing the bug. we are on the way to fix it.

2) many memory can be put into hotplugable memory, but we have not yet moved 
them
   into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.

This is a restriction in the currently kernel, we can't convert them 
quickly.
we must convert them step by step. example, we are converting the 
memory of
page_cgroup to hotplugable memory.


3) if the user(or firmware) specify the un-hotplugable memory too small, the 
system can't
   work, even can't boot.

Any feature/system has its own minimum requirements, the user should
meet the requirements and specify more un-hotplugable memory.
so I don't think it is a problem in kernel land.

But the problem 2)(above) make this feature's "minimum requirements"
much higher. It is the real thing that Yinghai worries about.

But all systems which use this feature can offer this higher requirement
very easily. The users should specify enough un-hotplugable memory
before and after we decrease the "minimum requirements".

The whole feature works very well if the user specify enough
un-hotplugable memory. So the problem 2) and 3) are not urgent
problems.

And our team has another problem, we are still not good at community work,
(example, the patch TITLE is total misleading), but we are growing up.
We are sorry and thank you for pointing out the mistakes.

The feature/patchset does have problems. But it is not good to tangle
all the problems together and revert the whole feature.

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 03:25 PM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 11:11 PM, Tang Chen  wrote:

On 02/27/2013 02:54 PM, Yinghai Lu wrote:


Those patches are tangled together.



No, they are not.

The following commits supports "movablemem_map=nn[KMG]@ss[KMG]".

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
 page_alloc: bootmem limit with movablecore_map
commit 42f47e27e761fee07da69e04612ec7dd0d490edd
 page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
 page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
 page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
 x86: get pg_data_t's memory from other node

And the following supports "movablemem_map=srat".

commit f7210e6c4ac795694106c1c5307134d3fc233e88
 mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
movablecore_map in memblock_overlaps_region().
commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
 acpi, memory-hotplug: support getting hotplug info from SRAT
commit 27168d38fa209073219abedbe6a9de7ba9acbfad
 acpi, memory-hotplug: extend movablemem_map ranges to the end of node
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
 acpi, memory-hotplug: parse SRAT before memblock is ready


those four can be reverted cleanly?


Sorry, if you want to revert, you just need to revert:
 commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
  acpi, memory-hotplug: parse SRAT before memblock is ready
 commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
  acpi, memory-hotplug: support getting hotplug info from SRAT

The other two have nothing to do with SRAT. And they are necessary.

Seeing from the code, I think it is clean. But we'd better test it.







Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.



Well, I think sometimes users don't like the SRAT memory style, and want to
increase or reduce hot-pluggable memory by themselves. And also, it is
useful
for debuging firmware bugs.

I agree that "movablemem_map=srat" functionality need more work to improve.
Can we not revert it, and improve it during 3.9rc ? I think during rc time,
at least we can fix the problems brought by early_parse_srat().


looks like acpi_override can not be fixed.


About this problem, I need to do some investigation, and I think we can 
have a try.


I do hope we can keep these patches. And put the improve work in the 
future. :)


Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 11:11 PM, Tang Chen  wrote:
> On 02/27/2013 02:54 PM, Yinghai Lu wrote:
>>
>> Those patches are tangled together.
>
>
> No, they are not.
>
> The following commits supports "movablemem_map=nn[KMG]@ss[KMG]".
>
> commit fb06bc8e5f42f38c011de0e59481f464a82380f6
> page_alloc: bootmem limit with movablecore_map
> commit 42f47e27e761fee07da69e04612ec7dd0d490edd
> page_alloc: make movablemem_map have higher priority
> commit 6981ec31146cf19454c55c130625f6cee89aab95
> page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
> commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
> page_alloc: add movable_memmap kernel parameter
> commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
> x86: get pg_data_t's memory from other node
>
> And the following supports "movablemem_map=srat".
>
> commit f7210e6c4ac795694106c1c5307134d3fc233e88
> mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
> movablecore_map in memblock_overlaps_region().
> commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
> acpi, memory-hotplug: support getting hotplug info from SRAT
> commit 27168d38fa209073219abedbe6a9de7ba9acbfad
> acpi, memory-hotplug: extend movablemem_map ranges to the end of node
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> acpi, memory-hotplug: parse SRAT before memblock is ready

those four can be reverted cleanly?

>
>>
>> Also it looks funny to ask user to specify mem range in boot command
>> line to enable mem hotplug.
>
>
> Well, I think sometimes users don't like the SRAT memory style, and want to
> increase or reduce hot-pluggable memory by themselves. And also, it is
> useful
> for debuging firmware bugs.
>
> I agree that "movablemem_map=srat" functionality need more work to improve.
> Can we not revert it, and improve it during 3.9rc ? I think during rc time,
> at least we can fix the problems brought by early_parse_srat().

looks like acpi_override can not be fixed.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 02:54 PM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 9:49 PM, Yasuaki Ishimatsu
  wrote:

2013/02/27 14:11, Yinghai Lu wrote:


On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
  wrote:


2013/02/27 13:04, Yinghai Lu wrote:



On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
  wrote:



2013/02/27 11:30, Yinghai Lu wrote:



Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.





In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...






for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.




I think so too. By this, memory hot plug becomes more useful.



I agree with your idea. But I think above ideas is future work.
So at first we should use movable memory for memory hot plug.
After that, we will implement above ideas.






so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.



Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.



We will fix this problem with no objection. So please wait a while.

And the problem occurs by "movablemem_map=srat" not
"movablemem_map=nn[KMG]@ss[KMG]"
At least, if you want to revert it, you should revert only
"movablemem_map=srat" part.


Those patches are tangled together.


No, they are not.

The following commits supports "movablemem_map=nn[KMG]@ss[KMG]".

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
page_alloc: bootmem limit with movablecore_map
commit 42f47e27e761fee07da69e04612ec7dd0d490edd
page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
page_alloc: introduce zone_movable_limit[] to keep movable limit 
for nodes

commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
x86: get pg_data_t's memory from other node

And the following supports "movablemem_map=srat".

commit f7210e6c4ac795694106c1c5307134d3fc233e88
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
movablecore_map in memblock_overlaps_region().

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT
commit 27168d38fa209073219abedbe6a9de7ba9acbfad
acpi, memory-hotplug: extend movablemem_map ranges to the end of node
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready



Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.


Well, I think sometimes users don't like the SRAT memory style, and want to
increase or reduce hot-pluggable memory by themselves. And also, it is 
useful

for debuging firmware bugs.

I agree that "movablemem_map=srat" functionality need more work to improve.
Can we not revert it, and improve it during 3.9rc ? I think during rc time,
at least we can fix the problems brought by early_parse_srat().

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 9:49 PM, Yasuaki Ishimatsu
 wrote:
> 2013/02/27 14:11, Yinghai Lu wrote:
>>
>> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
>>  wrote:
>>>
>>> 2013/02/27 13:04, Yinghai Lu wrote:


 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
  wrote:
>
>
> 2013/02/27 11:30, Yinghai Lu wrote:
>>
>>
>> Do you mean you can not boot one socket system with 1G ram ?
>> Assume socket 0 does not support hotplug, other 31 sockets support hot
>> plug.
>>
>> So we could boot system only with socket0, and later one by one hot
>> add other cpus.
>
>
>
>
> In this case, system can boot. But other cpus with bunch of ram hot
> plug may fails, since system does not have enough memory for cover
> hot added memory. When hot adding memory device, kernel object for the
> memory is allocated from 1G ram since hot added memory has not been
> enabled.
>

 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...

>>>
>
 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.
>>>
>>>
>>>
>>> I think so too. By this, memory hot plug becomes more useful.
>
>
> I agree with your idea. But I think above ideas is future work.
> So at first we should use movable memory for memory hot plug.
> After that, we will implement above ideas.
>
>
>>>

 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.
>>
>>
>> Tang,  Yasuaki, Andrew,
>>
>> Please check if you are ok with attached reverting patch.
>
>
> We will fix this problem with no objection. So please wait a while.
>
> And the problem occurs by "movablemem_map=srat" not
> "movablemem_map=nn[KMG]@ss[KMG]"
> At least, if you want to revert it, you should revert only
> "movablemem_map=srat" part.

Those patches are tangled together.

Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 14:11, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 wrote:

2013/02/27 13:04, Yinghai Lu wrote:


On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 wrote:


2013/02/27 11:30, Yinghai Lu wrote:


Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.




In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...






for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.



I think so too. By this, memory hot plug becomes more useful.


I agree with your idea. But I think above ideas is future work.
So at first we should use movable memory for memory hot plug.
After that, we will implement above ideas.





so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.


Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.


We will fix this problem with no objection. So please wait a while.

And the problem occurs by "movablemem_map=srat" not 
"movablemem_map=nn[KMG]@ss[KMG]"
At least, if you want to revert it, you should revert only 
"movablemem_map=srat" part.

Thanks,
Yasuaki Ishimatsu  



Tim, Don,
Can you try if attached reverting patch fix all the problems for you ?

Thanks

Yinghai




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 wrote:
> 2013/02/27 13:04, Yinghai Lu wrote:
>>
>> On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
>>  wrote:
>>>
>>> 2013/02/27 11:30, Yinghai Lu wrote:

 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.
>>>
>>>
>>>
>>> In this case, system can boot. But other cpus with bunch of ram hot
>>> plug may fails, since system does not have enough memory for cover
>>> hot added memory. When hot adding memory device, kernel object for the
>>> memory is allocated from 1G ram since hot added memory has not been
>>> enabled.
>>>
>>
>> yes, it may fail, if the one node memory need page table and vmemmap
>> is more than 1g ...
>>
>
>> for hot add memory we need to
>> 1. add another wrapper for init_memory_mapping, just like
>> init_mem_mapping() for booting path.
>> 2. we need make memblock more generic, so we can use it with hot add
>> memory during runtime.
>> 3. with that we can initialize page table for hot added node with ram.
>> a. initial page table for 2M near node top is from node0 ( that does
>> not support hot plug).
>> b. then will use 2M for memory below node top...
>> c. with that we will make sure page table stay on local node.
>>   alloc_low_pages need to be updated to support that.
>> 4. need to make sure vmemmap on local node too.
>
>
> I think so too. By this, memory hot plug becomes more useful.
>
>>
>> so hot-remove node will work too later.
>>
>> In the long run, we should make booting path and hot adding more
>> similar and share at most code.
>> That will make code get more test coverage.

Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.

Tim, Don,
Can you try if attached reverting patch fix all the problems for you ?

Thanks

Yinghai


revert_movable_map.patch
Description: Binary data


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 13:04, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 wrote:

2013/02/27 11:30, Yinghai Lu wrote:

Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.



In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...




for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
  alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.


I think so too. By this, memory hot plug becomes more useful.

Thanks,
Yasuaki Ishimatsu



so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.

Thanks

Yinghai




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 10:24 AM, Yinghai Lu wrote:

After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be
hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on
movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn



AFAIK, Linux kernel now cannot migrate memory used by the kernel because. So
any memory
used by the kernel should not be on movable area.


that depends.

initrd will be freed later, so it should be put anywhere that is under
max_pfn during boot.



OK,but initrd is not that big. Actually, before my code start to work, 
memblock
has reserved some memory. But it is not that big. On the other hand, it 
is not that
easy to find out which memory should be kept in unmovable area, and 
which should not.







If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.



page tables and vmemmap are kernel memory. They should not be movable, I
think.


why do you need to migrate pagetable and vmemmap for the memory range
that will be
offline ?


Hum, you are right. :)

True, we can store pagetable and vmemmap on the node that is hot-pluggable.
But just like the page_cgroup structs, we need additional work to handle it.

But based on the existing code, we didn't do any special handling. I think
we can improve it if needed. :)








assume first cpu only have 1G ram, and other 31 socket will have bunch of
ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.



Yes, you are right. And a more extreme situation has been talked about by
HPA.

"If all the memory is hot-pluggable, then the kernel won't be able to boot."

So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
 acpi, memory-hotplug: support getting hotplug info from SRAT

I have excluded all the memory reserved by memblock, and any node that has
memory
reserved by memblock will be set to un-hot-pluggable, which means we will
have
enough memory (all the memory on the node) to boot the kernel. So I think
the problem
you are talking about has been solved.


I don't think that you understand the problem.

for the system that will put all pagetable and vmemmap on the 1G ram
of first cpu.
as all other ram are MOVABLE, so memblock_find_in_range will not use any local
ram on those nodes.



Yes, I konw that. :)

In this case, the kernel will not able to use local ram on those nodes. 
It will

cause some performance down.

I mean if the 1G ram is not enough for the kernel to boot, the current 
code will

set all the ram on the same node as un-hot-pluggable.

If all the ram on the node is not enough for kernel to boot, it is a 
really extreme

situation, IIUC.

I think users can solve this problem in two ways:
1) add more ram to the node.
2) use movablemem_map=nn[KMG]@ss[KMG] to configure more ram as unmovable.


Thanks. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 wrote:
> 2013/02/27 11:30, Yinghai Lu wrote:
>> Do you mean you can not boot one socket system with 1G ram ?
>> Assume socket 0 does not support hotplug, other 31 sockets support hot
>> plug.
>>
>> So we could boot system only with socket0, and later one by one hot
>> add other cpus.
>
>
> In this case, system can boot. But other cpus with bunch of ram hot
> plug may fails, since system does not have enough memory for cover
> hot added memory. When hot adding memory device, kernel object for the
> memory is allocated from 1G ram since hot added memory has not been
> enabled.
>

yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...

for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
 alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.

so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 11:30, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 4:52 PM, Yasuaki Ishimatsu
 wrote:

2013/02/27 7:44, Yinghai Lu wrote:


that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from
early_parse_srat.
it should be moved before that



 c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...



After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.




after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be
hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on
movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn



If you use "movablemem_map=srat", abobe memory can not use movable memory.
But in my understanding, current Linux cannot move above memory. So above
memory should not use movable memory.



that depends, like relocating initrd to different position.





If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.




assume first cpu only have 1G ram, and other 31 socket will have bunch of
ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.



Even if we solve your above mentions, the system cannot boot.
In this case, user should:
   o add ram to first cpu
   o decreases hotpluggable ram by :
 - changing hotpluggable information of SRAT
 - using movablemem_map=nn[KMG]@ss[KMG]





Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.


In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.

Thanks,
Yasuaki Ishimatsu



We should simulate that way, just like boot system with PXM0 at first
and later during acpi scan, add other cpus/ram.

Thanks

Yinghai




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 4:52 PM, Yasuaki Ishimatsu
 wrote:
> 2013/02/27 7:44, Yinghai Lu wrote:

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from
 early_parse_srat.
 it should be moved before that
>>>
>>>
>>> c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
>>> early before override from INITRD is settled.
>>>

 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...
>>
>>
>> After looked at the code more, thought that theory that does not let
>> kernel use ram
>> on hotplug area is not right.
>>
>
>> after that commit, following range can not use movable ram:
>> 1. real_mode code well..funny, legacy cpu0 [0,1M) could be
>> hot-removed?
>> 2. dma_continguous ?
>> 3. log buff ring.
>> 4. initrd... why it will be freed after booting, so it could be on
>> movable...
>> 5. crashkernel for kdump...: : looks like we can not put kdump kernel
>> above 4G anymore
>> 6. initmem_init: it will allocate page table to setup kernel mapping
>> for memory..., it should
>> be with BRK and near end of max_pfn
>
>
> If you use "movablemem_map=srat", abobe memory can not use movable memory.
> But in my understanding, current Linux cannot move above memory. So above
> memory should not use movable memory.
>

that depends, like relocating initrd to different position.

>
>>
>> If node is hotplugable, the mem related stuff like page table and
>> vmemmap could be
>> on the that node without problem and should be on that node.
>>
>
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of
>> ram
>> and those cpu with ram could be hotadd and hotremoved.
>> Now you want to put page table and vmemmap on first node.
>> The system would not boot as not enough memory for cover whole system RAM.
>
>
> Even if we solve your above mentions, the system cannot boot.
> In this case, user should:
>   o add ram to first cpu
>   o decreases hotpluggable ram by :
> - changing hotpluggable information of SRAT
> - using movablemem_map=nn[KMG]@ss[KMG]

Do you mean you can not boot one socket system with 1G ram ?

Assume socket 0 does not support hotplug, other 31 sockets support hot plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.

We should simulate that way, just like boot system with PXM0 at first
and later during acpi scan, add other cpus/ram.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 6:14 PM, Tang Chen  wrote:
> Hi Yinghai,
>
> Please see below. :)
>
>
> On 02/27/2013 06:44 AM, Yinghai Lu wrote:

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i<  MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from
 early_parse_srat.
 it should be moved before that
>>>
>>>
>>> c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
>>> early before override from INITRD is settled.
>>>

 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...
>>
>>
>> After looked at the code more, thought that theory that does not let
>> kernel use ram
>> on hotplug area is not right.
>>
>> after that commit, following range can not use movable ram:
>> 1. real_mode code well..funny, legacy cpu0 [0,1M) could be
>> hot-removed?
>> 2. dma_continguous ?
>> 3. log buff ring.
>> 4. initrd... why it will be freed after booting, so it could be on
>> movable...
>> 5. crashkernel for kdump...: : looks like we can not put kdump kernel
>> above 4G anymore
>> 6. initmem_init: it will allocate page table to setup kernel mapping
>> for memory..., it should
>> be with BRK and near end of max_pfn
>
>
> AFAIK, Linux kernel now cannot migrate memory used by the kernel because. So
> any memory
> used by the kernel should not be on movable area.

that depends.

initrd will be freed later, so it should be put anywhere that is under
max_pfn during boot.

>
>
>>
>> If node is hotplugable, the mem related stuff like page table and
>> vmemmap could be
>> on the that node without problem and should be on that node.
>
>
> page tables and vmemmap are kernel memory. They should not be movable, I
> think.

why do you need to migrate pagetable and vmemmap for the memory range
that will be
offline ?

>
>
>>
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of
>> ram
>> and those cpu with ram could be hotadd and hotremoved.
>> Now you want to put page table and vmemmap on first node.
>> The system would not boot as not enough memory for cover whole system RAM.
>
>
> Yes, you are right. And a more extreme situation has been talked about by
> HPA.
>
> "If all the memory is hot-pluggable, then the kernel won't be able to boot."
>
> So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
> acpi, memory-hotplug: support getting hotplug info from SRAT
>
> I have excluded all the memory reserved by memblock, and any node that has
> memory
> reserved by memblock will be set to un-hot-pluggable, which means we will
> have
> enough memory (all the memory on the node) to boot the kernel. So I think
> the problem
> you are talking about has been solved.

I don't think that you understand the problem.

for the system that will put all pagetable and vmemmap on the 1G ram
of first cpu.
as all other ram are MOVABLE, so memblock_find_in_range will not use any local
ram on those nodes.

>
>
>>
>> e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be
>> just
>> reverted now.
>>
>> Thanks
>>
>> Yinghai
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

Hi Yinghai,

Please see below. :)

On 02/27/2013 06:44 AM, Yinghai Lu wrote:

that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i<  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that


c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...


After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn


AFAIK, Linux kernel now cannot migrate memory used by the kernel 
because. So any memory

used by the kernel should not be on movable area.



If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.


page tables and vmemmap are kernel memory. They should not be movable, I 
think.




assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.


Yes, you are right. And a more extreme situation has been talked about 
by HPA.


"If all the memory is hot-pluggable, then the kernel won't be able to boot."

So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
acpi, memory-hotplug: support getting hotplug info from SRAT

I have excluded all the memory reserved by memblock, and any node that 
has memory
reserved by memblock will be set to un-hot-pluggable, which means we 
will have
enough memory (all the memory on the node) to boot the kernel. So I 
think the problem

you are talking about has been solved.



e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 7:44, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 1:36 PM, Yinghai Lu  wrote:

On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu  wrote:

On Mon, Feb 25, 2013 at 1:27 PM, Don Morris  wrote:

On 02/25/2013 10:32 AM, Tim Gardner wrote:

On 02/25/2013 08:02 AM, Tim Gardner wrote:

Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:



Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
is having an impact:


Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
still Sandy Bridge, though I don't think that matters).

Bisection leads to:
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
parse SRAT before memblock is ready

Nothing terribly obvious leaps out as to *why* that reshuffling messes
up the cpu<-->node bindings, but I wanted to put this out there while
I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
outs during boot are the same either way -- if you look at the APIC
numbers of the processors (from /proc/cpuinfo), the processors should
be assigned to the correct node, but they aren't.] cc'ing Tang Chen
in case this is obvious to him or he's already fixed it somewhere not
on Linus's tree yet.

Don Morris



[0.170435] [ cut here ]
[0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
topology_sane.isra.2+0x71/0x84()
[0.170452] Hardware name: S2600CP
[0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
node! [node: 1 != 0]. Ignoring dependency.
[0.156000] smpboot: Booting Node   1, Processors  #1
[0.170455] Modules linked in:
[0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
[0.170461] Call Trace:
[0.170466]  [] warn_slowpath_common+0x7f/0xc0
[0.170473]  [] warn_slowpath_fmt+0x46/0x50
[0.170477]  [] topology_sane.isra.2+0x71/0x84
[0.170482]  [] set_cpu_sibling_map+0x23f/0x436
[0.170487]  [] start_secondary+0x137/0x201
[0.170502] ---[ end trace 09222f596307ca1d ]---


that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that


c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...


After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.




after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn


If you use "movablemem_map=srat", abobe memory can not use movable memory.
But in my understanding, current Linux cannot move above memory. So above
memory should not use movable memory.



If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.




assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.


Even if we solve your above mentions, the system cannot boot.
In this case, user should:
  o add ram to first cpu
  o decreases hotpluggable ram by :
- changing hotpluggable information of SRAT
- using movablemem_map=nn[KMG]@ss[KMG]

Thansk,
Yasuaki Ishimatsu



e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 1:36 PM, Yinghai Lu  wrote:
> On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu  wrote:
>> On Mon, Feb 25, 2013 at 1:27 PM, Don Morris  wrote:
>>> On 02/25/2013 10:32 AM, Tim Gardner wrote:
 On 02/25/2013 08:02 AM, Tim Gardner wrote:
> Is this an expected warning ? I'll boot a vanilla kernel just to be sure.
>
> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:
>

 Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
 is having an impact:
>>>
>>> Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
>>> still Sandy Bridge, though I don't think that matters).
>>>
>>> Bisection leads to:
>>> # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
>>> parse SRAT before memblock is ready
>>>
>>> Nothing terribly obvious leaps out as to *why* that reshuffling messes
>>> up the cpu<-->node bindings, but I wanted to put this out there while
>>> I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
>>> outs during boot are the same either way -- if you look at the APIC
>>> numbers of the processors (from /proc/cpuinfo), the processors should
>>> be assigned to the correct node, but they aren't.] cc'ing Tang Chen
>>> in case this is obvious to him or he's already fixed it somewhere not
>>> on Linus's tree yet.
>>>
>>> Don Morris
>>>

 [0.170435] [ cut here ]
 [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
 topology_sane.isra.2+0x71/0x84()
 [0.170452] Hardware name: S2600CP
 [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
 node! [node: 1 != 0]. Ignoring dependency.
 [0.156000] smpboot: Booting Node   1, Processors  #1
 [0.170455] Modules linked in:
 [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
 [0.170461] Call Trace:
 [0.170466]  [] warn_slowpath_common+0x7f/0xc0
 [0.170473]  [] warn_slowpath_fmt+0x46/0x50
 [0.170477]  [] topology_sane.isra.2+0x71/0x84
 [0.170482]  [] set_cpu_sibling_map+0x23f/0x436
 [0.170487]  [] start_secondary+0x137/0x201
 [0.170502] ---[ end trace 09222f596307ca1d ]---
>>
>> that commit is totally broken, and it should be reverted.
>>
>> 1. numa_init is called several times, NOT just for srat. so those
>>nodes_clear(numa_nodes_parsed)
>>memset(_meminfo, 0, sizeof(numa_meminfo))
>> can not be just removed.
>> please consider sequence is: numaq, srat, amd, dummy.
>> You need to make fall back path working!
>>
>> 2. simply split acpi_numa_init to early_parse_srat.
>> a. that early_parse_srat is NOT called for ia64, so you break ia64.
>> b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
>>  set_apicid_to_node(i, NUMA_NO_NODE)
>> still left in numa_init. So it will just clear result from early_parse_srat.
>> it should be moved before that
>
>c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
> early before override from INITRD is settled.
>
>>
>> 3. that patch TITLE is total misleading, there is NO x86 in the title,
>> but it changes
>> to x86 code.
>>
>> 4, it does not CC to TJ and other numa guys...

After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn

If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.

assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.

e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu  wrote:
> On Mon, Feb 25, 2013 at 1:27 PM, Don Morris  wrote:
>> On 02/25/2013 10:32 AM, Tim Gardner wrote:
>>> On 02/25/2013 08:02 AM, Tim Gardner wrote:
 Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

 rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:

>>>
>>> Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
>>> is having an impact:
>>
>> Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
>> still Sandy Bridge, though I don't think that matters).
>>
>> Bisection leads to:
>> # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
>> parse SRAT before memblock is ready
>>
>> Nothing terribly obvious leaps out as to *why* that reshuffling messes
>> up the cpu<-->node bindings, but I wanted to put this out there while
>> I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
>> outs during boot are the same either way -- if you look at the APIC
>> numbers of the processors (from /proc/cpuinfo), the processors should
>> be assigned to the correct node, but they aren't.] cc'ing Tang Chen
>> in case this is obvious to him or he's already fixed it somewhere not
>> on Linus's tree yet.
>>
>> Don Morris
>>
>>>
>>> [0.170435] [ cut here ]
>>> [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
>>> topology_sane.isra.2+0x71/0x84()
>>> [0.170452] Hardware name: S2600CP
>>> [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
>>> node! [node: 1 != 0]. Ignoring dependency.
>>> [0.156000] smpboot: Booting Node   1, Processors  #1
>>> [0.170455] Modules linked in:
>>> [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
>>> [0.170461] Call Trace:
>>> [0.170466]  [] warn_slowpath_common+0x7f/0xc0
>>> [0.170473]  [] warn_slowpath_fmt+0x46/0x50
>>> [0.170477]  [] topology_sane.isra.2+0x71/0x84
>>> [0.170482]  [] set_cpu_sibling_map+0x23f/0x436
>>> [0.170487]  [] start_secondary+0x137/0x201
>>> [0.170502] ---[ end trace 09222f596307ca1d ]---
>
> that commit is totally broken, and it should be reverted.
>
> 1. numa_init is called several times, NOT just for srat. so those
>nodes_clear(numa_nodes_parsed)
>memset(_meminfo, 0, sizeof(numa_meminfo))
> can not be just removed.
> please consider sequence is: numaq, srat, amd, dummy.
> You need to make fall back path working!
>
> 2. simply split acpi_numa_init to early_parse_srat.
> a. that early_parse_srat is NOT called for ia64, so you break ia64.
> b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
>  set_apicid_to_node(i, NUMA_NO_NODE)
> still left in numa_init. So it will just clear result from early_parse_srat.
> it should be moved before that

   c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.

>
> 3. that patch TITLE is total misleading, there is NO x86 in the title,
> but it changes
> to x86 code.
>
> 4, it does not CC to TJ and other numa guys...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu ying...@kernel.org wrote:
 On Mon, Feb 25, 2013 at 1:27 PM, Don Morris don.mor...@hp.com wrote:
 On 02/25/2013 10:32 AM, Tim Gardner wrote:
 On 02/25/2013 08:02 AM, Tim Gardner wrote:
 Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

 rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:


 Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
 is having an impact:

 Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
 still Sandy Bridge, though I don't think that matters).

 Bisection leads to:
 # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
 parse SRAT before memblock is ready

 Nothing terribly obvious leaps out as to *why* that reshuffling messes
 up the cpu--node bindings, but I wanted to put this out there while
 I poke around further. [Note that the SRAT: PXM - APIC - Node print
 outs during boot are the same either way -- if you look at the APIC
 numbers of the processors (from /proc/cpuinfo), the processors should
 be assigned to the correct node, but they aren't.] cc'ing Tang Chen
 in case this is obvious to him or he's already fixed it somewhere not
 on Linus's tree yet.

 Don Morris


 [0.170435] [ cut here ]
 [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
 topology_sane.isra.2+0x71/0x84()
 [0.170452] Hardware name: S2600CP
 [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
 node! [node: 1 != 0]. Ignoring dependency.
 [0.156000] smpboot: Booting Node   1, Processors  #1
 [0.170455] Modules linked in:
 [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
 [0.170461] Call Trace:
 [0.170466]  [810597bf] warn_slowpath_common+0x7f/0xc0
 [0.170473]  [810598b6] warn_slowpath_fmt+0x46/0x50
 [0.170477]  [816cc752] topology_sane.isra.2+0x71/0x84
 [0.170482]  [816cc9de] set_cpu_sibling_map+0x23f/0x436
 [0.170487]  [816ccd0c] start_secondary+0x137/0x201
 [0.170502] ---[ end trace 09222f596307ca1d ]---

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(numa_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from early_parse_srat.
 it should be moved before that

   c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.


 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 1:36 PM, Yinghai Lu ying...@kernel.org wrote:
 On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu ying...@kernel.org wrote:
 On Mon, Feb 25, 2013 at 1:27 PM, Don Morris don.mor...@hp.com wrote:
 On 02/25/2013 10:32 AM, Tim Gardner wrote:
 On 02/25/2013 08:02 AM, Tim Gardner wrote:
 Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

 rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:


 Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
 is having an impact:

 Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
 still Sandy Bridge, though I don't think that matters).

 Bisection leads to:
 # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
 parse SRAT before memblock is ready

 Nothing terribly obvious leaps out as to *why* that reshuffling messes
 up the cpu--node bindings, but I wanted to put this out there while
 I poke around further. [Note that the SRAT: PXM - APIC - Node print
 outs during boot are the same either way -- if you look at the APIC
 numbers of the processors (from /proc/cpuinfo), the processors should
 be assigned to the correct node, but they aren't.] cc'ing Tang Chen
 in case this is obvious to him or he's already fixed it somewhere not
 on Linus's tree yet.

 Don Morris


 [0.170435] [ cut here ]
 [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
 topology_sane.isra.2+0x71/0x84()
 [0.170452] Hardware name: S2600CP
 [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
 node! [node: 1 != 0]. Ignoring dependency.
 [0.156000] smpboot: Booting Node   1, Processors  #1
 [0.170455] Modules linked in:
 [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
 [0.170461] Call Trace:
 [0.170466]  [810597bf] warn_slowpath_common+0x7f/0xc0
 [0.170473]  [810598b6] warn_slowpath_fmt+0x46/0x50
 [0.170477]  [816cc752] topology_sane.isra.2+0x71/0x84
 [0.170482]  [816cc9de] set_cpu_sibling_map+0x23f/0x436
 [0.170487]  [816ccd0c] start_secondary+0x137/0x201
 [0.170502] ---[ end trace 09222f596307ca1d ]---

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(numa_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from early_parse_srat.
 it should be moved before that

c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
 early before override from INITRD is settled.


 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...

After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn

If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.

assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.

e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 7:44, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 1:36 PM, Yinghai Lu ying...@kernel.org wrote:

On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu ying...@kernel.org wrote:

On Mon, Feb 25, 2013 at 1:27 PM, Don Morris don.mor...@hp.com wrote:

On 02/25/2013 10:32 AM, Tim Gardner wrote:

On 02/25/2013 08:02 AM, Tim Gardner wrote:

Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:



Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
is having an impact:


Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
still Sandy Bridge, though I don't think that matters).

Bisection leads to:
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
parse SRAT before memblock is ready

Nothing terribly obvious leaps out as to *why* that reshuffling messes
up the cpu--node bindings, but I wanted to put this out there while
I poke around further. [Note that the SRAT: PXM - APIC - Node print
outs during boot are the same either way -- if you look at the APIC
numbers of the processors (from /proc/cpuinfo), the processors should
be assigned to the correct node, but they aren't.] cc'ing Tang Chen
in case this is obvious to him or he's already fixed it somewhere not
on Linus's tree yet.

Don Morris



[0.170435] [ cut here ]
[0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
topology_sane.isra.2+0x71/0x84()
[0.170452] Hardware name: S2600CP
[0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
node! [node: 1 != 0]. Ignoring dependency.
[0.156000] smpboot: Booting Node   1, Processors  #1
[0.170455] Modules linked in:
[0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
[0.170461] Call Trace:
[0.170466]  [810597bf] warn_slowpath_common+0x7f/0xc0
[0.170473]  [810598b6] warn_slowpath_fmt+0x46/0x50
[0.170477]  [816cc752] topology_sane.isra.2+0x71/0x84
[0.170482]  [816cc9de] set_cpu_sibling_map+0x23f/0x436
[0.170487]  [816ccd0c] start_secondary+0x137/0x201
[0.170502] ---[ end trace 09222f596307ca1d ]---


that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that


c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...


After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.




after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn


If you use movablemem_map=srat, abobe memory can not use movable memory.
But in my understanding, current Linux cannot move above memory. So above
memory should not use movable memory.



If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.




assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.


Even if we solve your above mentions, the system cannot boot.
In this case, user should:
  o add ram to first cpu
  o decreases hotpluggable ram by :
- changing hotpluggable information of SRAT
- using movablemem_map=nn[KMG]@ss[KMG]

Thansk,
Yasuaki Ishimatsu



e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line 

Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

Hi Yinghai,

Please see below. :)

On 02/27/2013 06:44 AM, Yinghai Lu wrote:

that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that


c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...


After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn


AFAIK, Linux kernel now cannot migrate memory used by the kernel 
because. So any memory

used by the kernel should not be on movable area.



If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.


page tables and vmemmap are kernel memory. They should not be movable, I 
think.




assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.


Yes, you are right. And a more extreme situation has been talked about 
by HPA.


If all the memory is hot-pluggable, then the kernel won't be able to boot.

So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
acpi, memory-hotplug: support getting hotplug info from SRAT

I have excluded all the memory reserved by memblock, and any node that 
has memory
reserved by memblock will be set to un-hot-pluggable, which means we 
will have
enough memory (all the memory on the node) to boot the kernel. So I 
think the problem

you are talking about has been solved.



e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 6:14 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 Hi Yinghai,

 Please see below. :)


 On 02/27/2013 06:44 AM, Yinghai Lu wrote:

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(numa_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from
 early_parse_srat.
 it should be moved before that


 c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
 early before override from INITRD is settled.


 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...


 After looked at the code more, thought that theory that does not let
 kernel use ram
 on hotplug area is not right.

 after that commit, following range can not use movable ram:
 1. real_mode code well..funny, legacy cpu0 [0,1M) could be
 hot-removed?
 2. dma_continguous ?
 3. log buff ring.
 4. initrd... why it will be freed after booting, so it could be on
 movable...
 5. crashkernel for kdump...: : looks like we can not put kdump kernel
 above 4G anymore
 6. initmem_init: it will allocate page table to setup kernel mapping
 for memory..., it should
 be with BRK and near end of max_pfn


 AFAIK, Linux kernel now cannot migrate memory used by the kernel because. So
 any memory
 used by the kernel should not be on movable area.

that depends.

initrd will be freed later, so it should be put anywhere that is under
max_pfn during boot.




 If node is hotplugable, the mem related stuff like page table and
 vmemmap could be
 on the that node without problem and should be on that node.


 page tables and vmemmap are kernel memory. They should not be movable, I
 think.

why do you need to migrate pagetable and vmemmap for the memory range
that will be
offline ?




 assume first cpu only have 1G ram, and other 31 socket will have bunch of
 ram
 and those cpu with ram could be hotadd and hotremoved.
 Now you want to put page table and vmemmap on first node.
 The system would not boot as not enough memory for cover whole system RAM.


 Yes, you are right. And a more extreme situation has been talked about by
 HPA.

 If all the memory is hot-pluggable, then the kernel won't be able to boot.

 So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
 acpi, memory-hotplug: support getting hotplug info from SRAT

 I have excluded all the memory reserved by memblock, and any node that has
 memory
 reserved by memblock will be set to un-hot-pluggable, which means we will
 have
 enough memory (all the memory on the node) to boot the kernel. So I think
 the problem
 you are talking about has been solved.

I don't think that you understand the problem.

for the system that will put all pagetable and vmemmap on the 1G ram
of first cpu.
as all other ram are MOVABLE, so memblock_find_in_range will not use any local
ram on those nodes.




 e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be
 just
 reverted now.

 Thanks

 Yinghai


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 4:52 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 7:44, Yinghai Lu wrote:

 that commit is totally broken, and it should be reverted.

 1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(numa_meminfo, 0, sizeof(numa_meminfo))
 can not be just removed.
 please consider sequence is: numaq, srat, amd, dummy.
 You need to make fall back path working!

 2. simply split acpi_numa_init to early_parse_srat.
 a. that early_parse_srat is NOT called for ia64, so you break ia64.
 b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
 still left in numa_init. So it will just clear result from
 early_parse_srat.
 it should be moved before that


 c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
 early before override from INITRD is settled.


 3. that patch TITLE is total misleading, there is NO x86 in the title,
 but it changes
 to x86 code.

 4, it does not CC to TJ and other numa guys...


 After looked at the code more, thought that theory that does not let
 kernel use ram
 on hotplug area is not right.


 after that commit, following range can not use movable ram:
 1. real_mode code well..funny, legacy cpu0 [0,1M) could be
 hot-removed?
 2. dma_continguous ?
 3. log buff ring.
 4. initrd... why it will be freed after booting, so it could be on
 movable...
 5. crashkernel for kdump...: : looks like we can not put kdump kernel
 above 4G anymore
 6. initmem_init: it will allocate page table to setup kernel mapping
 for memory..., it should
 be with BRK and near end of max_pfn


 If you use movablemem_map=srat, abobe memory can not use movable memory.
 But in my understanding, current Linux cannot move above memory. So above
 memory should not use movable memory.


that depends, like relocating initrd to different position.



 If node is hotplugable, the mem related stuff like page table and
 vmemmap could be
 on the that node without problem and should be on that node.


 assume first cpu only have 1G ram, and other 31 socket will have bunch of
 ram
 and those cpu with ram could be hotadd and hotremoved.
 Now you want to put page table and vmemmap on first node.
 The system would not boot as not enough memory for cover whole system RAM.


 Even if we solve your above mentions, the system cannot boot.
 In this case, user should:
   o add ram to first cpu
   o decreases hotpluggable ram by :
 - changing hotpluggable information of SRAT
 - using movablemem_map=nn[KMG]@ss[KMG]

Do you mean you can not boot one socket system with 1G ram ?

Assume socket 0 does not support hotplug, other 31 sockets support hot plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.

We should simulate that way, just like boot system with PXM0 at first
and later during acpi scan, add other cpus/ram.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 11:30, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 4:52 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:

2013/02/27 7:44, Yinghai Lu wrote:


that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
 nodes_clear(numa_nodes_parsed)
 memset(numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i  MAX_LOCAL_APIC; i++)
   set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from
early_parse_srat.
it should be moved before that



 c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.



3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...



After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.




after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be
hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on
movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn



If you use movablemem_map=srat, abobe memory can not use movable memory.
But in my understanding, current Linux cannot move above memory. So above
memory should not use movable memory.



that depends, like relocating initrd to different position.





If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.




assume first cpu only have 1G ram, and other 31 socket will have bunch of
ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.



Even if we solve your above mentions, the system cannot boot.
In this case, user should:
   o add ram to first cpu
   o decreases hotpluggable ram by :
 - changing hotpluggable information of SRAT
 - using movablemem_map=nn[KMG]@ss[KMG]





Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.


In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.

Thanks,
Yasuaki Ishimatsu



We should simulate that way, just like boot system with PXM0 at first
and later during acpi scan, add other cpus/ram.

Thanks

Yinghai




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 11:30, Yinghai Lu wrote:
 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.


 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...

for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
 alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.

so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 10:24 AM, Yinghai Lu wrote:

After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code well..funny, legacy cpu0 [0,1M) could be
hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on
movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn



AFAIK, Linux kernel now cannot migrate memory used by the kernel because. So
any memory
used by the kernel should not be on movable area.


that depends.

initrd will be freed later, so it should be put anywhere that is under
max_pfn during boot.



OK,but initrd is not that big. Actually, before my code start to work, 
memblock
has reserved some memory. But it is not that big. On the other hand, it 
is not that
easy to find out which memory should be kept in unmovable area, and 
which should not.







If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.



page tables and vmemmap are kernel memory. They should not be movable, I
think.


why do you need to migrate pagetable and vmemmap for the memory range
that will be
offline ?


Hum, you are right. :)

True, we can store pagetable and vmemmap on the node that is hot-pluggable.
But just like the page_cgroup structs, we need additional work to handle it.

But based on the existing code, we didn't do any special handling. I think
we can improve it if needed. :)








assume first cpu only have 1G ram, and other 31 socket will have bunch of
ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.



Yes, you are right. And a more extreme situation has been talked about by
HPA.

If all the memory is hot-pluggable, then the kernel won't be able to boot.

So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
 acpi, memory-hotplug: support getting hotplug info from SRAT

I have excluded all the memory reserved by memblock, and any node that has
memory
reserved by memblock will be set to un-hot-pluggable, which means we will
have
enough memory (all the memory on the node) to boot the kernel. So I think
the problem
you are talking about has been solved.


I don't think that you understand the problem.

for the system that will put all pagetable and vmemmap on the 1G ram
of first cpu.
as all other ram are MOVABLE, so memblock_find_in_range will not use any local
ram on those nodes.



Yes, I konw that. :)

In this case, the kernel will not able to use local ram on those nodes. 
It will

cause some performance down.

I mean if the 1G ram is not enough for the kernel to boot, the current 
code will

set all the ram on the same node as un-hot-pluggable.

If all the ram on the node is not enough for kernel to boot, it is a 
really extreme

situation, IIUC.

I think users can solve this problem in two ways:
1) add more ram to the node.
2) use movablemem_map=nn[KMG]@ss[KMG] to configure more ram as unmovable.


Thanks. :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 13:04, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:

2013/02/27 11:30, Yinghai Lu wrote:

Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.



In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...




for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
  alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.


I think so too. By this, memory hot plug becomes more useful.

Thanks,
Yasuaki Ishimatsu



so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.

Thanks

Yinghai




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 13:04, Yinghai Lu wrote:

 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:

 2013/02/27 11:30, Yinghai Lu wrote:

 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...


 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.


 I think so too. By this, memory hot plug becomes more useful.


 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.

Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.

Tim, Don,
Can you try if attached reverting patch fix all the problems for you ?

Thanks

Yinghai


revert_movable_map.patch
Description: Binary data


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yasuaki Ishimatsu

2013/02/27 14:11, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:

2013/02/27 13:04, Yinghai Lu wrote:


On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:


2013/02/27 11:30, Yinghai Lu wrote:


Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.




In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...






for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.



I think so too. By this, memory hot plug becomes more useful.


I agree with your idea. But I think above ideas is future work.
So at first we should use movable memory for memory hot plug.
After that, we will implement above ideas.





so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.


Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.


We will fix this problem with no objection. So please wait a while.

And the problem occurs by movablemem_map=srat not 
movablemem_map=nn[KMG]@ss[KMG]
At least, if you want to revert it, you should revert only 
movablemem_map=srat part.

Thanks,
Yasuaki Ishimatsu  



Tim, Don,
Can you try if attached reverting patch fix all the problems for you ?

Thanks

Yinghai




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 9:49 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 14:11, Yinghai Lu wrote:

 On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:

 2013/02/27 13:04, Yinghai Lu wrote:


 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:


 2013/02/27 11:30, Yinghai Lu wrote:


 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.




 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...



 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.



 I think so too. By this, memory hot plug becomes more useful.


 I agree with your idea. But I think above ideas is future work.
 So at first we should use movable memory for memory hot plug.
 After that, we will implement above ideas.




 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.


 Tang,  Yasuaki, Andrew,

 Please check if you are ok with attached reverting patch.


 We will fix this problem with no objection. So please wait a while.

 And the problem occurs by movablemem_map=srat not
 movablemem_map=nn[KMG]@ss[KMG]
 At least, if you want to revert it, you should revert only
 movablemem_map=srat part.

Those patches are tangled together.

Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 02:54 PM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 9:49 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com  wrote:

2013/02/27 14:11, Yinghai Lu wrote:


On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com  wrote:


2013/02/27 13:04, Yinghai Lu wrote:



On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com  wrote:



2013/02/27 11:30, Yinghai Lu wrote:



Do you mean you can not boot one socket system with 1G ram ?
Assume socket 0 does not support hotplug, other 31 sockets support hot
plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.





In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.



yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...






for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.




I think so too. By this, memory hot plug becomes more useful.



I agree with your idea. But I think above ideas is future work.
So at first we should use movable memory for memory hot plug.
After that, we will implement above ideas.






so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.



Tang,  Yasuaki, Andrew,

Please check if you are ok with attached reverting patch.



We will fix this problem with no objection. So please wait a while.

And the problem occurs by movablemem_map=srat not
movablemem_map=nn[KMG]@ss[KMG]
At least, if you want to revert it, you should revert only
movablemem_map=srat part.


Those patches are tangled together.


No, they are not.

The following commits supports movablemem_map=nn[KMG]@ss[KMG].

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
page_alloc: bootmem limit with movablecore_map
commit 42f47e27e761fee07da69e04612ec7dd0d490edd
page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
page_alloc: introduce zone_movable_limit[] to keep movable limit 
for nodes

commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
x86: get pg_data_t's memory from other node

And the following supports movablemem_map=srat.

commit f7210e6c4ac795694106c1c5307134d3fc233e88
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
movablecore_map in memblock_overlaps_region().

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT
commit 27168d38fa209073219abedbe6a9de7ba9acbfad
acpi, memory-hotplug: extend movablemem_map ranges to the end of node
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready



Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.


Well, I think sometimes users don't like the SRAT memory style, and want to
increase or reduce hot-pluggable memory by themselves. And also, it is 
useful

for debuging firmware bugs.

I agree that movablemem_map=srat functionality need more work to improve.
Can we not revert it, and improve it during 3.9rc ? I think during rc time,
at least we can fix the problems brought by early_parse_srat().

Thanks. :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Yinghai Lu
On Tue, Feb 26, 2013 at 11:11 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 02/27/2013 02:54 PM, Yinghai Lu wrote:

 Those patches are tangled together.


 No, they are not.

 The following commits supports movablemem_map=nn[KMG]@ss[KMG].

 commit fb06bc8e5f42f38c011de0e59481f464a82380f6
 page_alloc: bootmem limit with movablecore_map
 commit 42f47e27e761fee07da69e04612ec7dd0d490edd
 page_alloc: make movablemem_map have higher priority
 commit 6981ec31146cf19454c55c130625f6cee89aab95
 page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
 commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
 page_alloc: add movable_memmap kernel parameter
 commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
 x86: get pg_data_t's memory from other node

 And the following supports movablemem_map=srat.

 commit f7210e6c4ac795694106c1c5307134d3fc233e88
 mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
 movablecore_map in memblock_overlaps_region().
 commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
 acpi, memory-hotplug: support getting hotplug info from SRAT
 commit 27168d38fa209073219abedbe6a9de7ba9acbfad
 acpi, memory-hotplug: extend movablemem_map ranges to the end of node
 commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
 acpi, memory-hotplug: parse SRAT before memblock is ready

those four can be reverted cleanly?



 Also it looks funny to ask user to specify mem range in boot command
 line to enable mem hotplug.


 Well, I think sometimes users don't like the SRAT memory style, and want to
 increase or reduce hot-pluggable memory by themselves. And also, it is
 useful
 for debuging firmware bugs.

 I agree that movablemem_map=srat functionality need more work to improve.
 Can we not revert it, and improve it during 3.9rc ? I think during rc time,
 at least we can fix the problems brought by early_parse_srat().

looks like acpi_override can not be fixed.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Tang Chen

On 02/27/2013 03:25 PM, Yinghai Lu wrote:

On Tue, Feb 26, 2013 at 11:11 PM, Tang Chentangc...@cn.fujitsu.com  wrote:

On 02/27/2013 02:54 PM, Yinghai Lu wrote:


Those patches are tangled together.



No, they are not.

The following commits supports movablemem_map=nn[KMG]@ss[KMG].

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
 page_alloc: bootmem limit with movablecore_map
commit 42f47e27e761fee07da69e04612ec7dd0d490edd
 page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
 page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
 page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
 x86: get pg_data_t's memory from other node

And the following supports movablemem_map=srat.

commit f7210e6c4ac795694106c1c5307134d3fc233e88
 mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect 
movablecore_map in memblock_overlaps_region().
commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
 acpi, memory-hotplug: support getting hotplug info from SRAT
commit 27168d38fa209073219abedbe6a9de7ba9acbfad
 acpi, memory-hotplug: extend movablemem_map ranges to the end of node
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
 acpi, memory-hotplug: parse SRAT before memblock is ready


those four can be reverted cleanly?


Sorry, if you want to revert, you just need to revert:
 commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
  acpi, memory-hotplug: parse SRAT before memblock is ready
 commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
  acpi, memory-hotplug: support getting hotplug info from SRAT

The other two have nothing to do with SRAT. And they are necessary.

Seeing from the code, I think it is clean. But we'd better test it.







Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.



Well, I think sometimes users don't like the SRAT memory style, and want to
increase or reduce hot-pluggable memory by themselves. And also, it is
useful
for debuging firmware bugs.

I agree that movablemem_map=srat functionality need more work to improve.
Can we not revert it, and improve it during 3.9rc ? I think during rc time,
at least we can fix the problems brought by early_parse_srat().


looks like acpi_override can not be fixed.


About this problem, I need to do some investigation, and I think we can 
have a try.


I do hope we can keep these patches. And put the improve work in the 
future. :)


Thanks. :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-26 Thread Lai Jiangshan
On 02/27/2013 01:11 PM, Yinghai Lu wrote:
 On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 13:04, Yinghai Lu wrote:

 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:

 2013/02/27 11:30, Yinghai Lu wrote:

 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...


 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.


 I think so too. By this, memory hot plug becomes more useful.


 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.
 
 Tang,  Yasuaki, Andrew,
 
 Please check if you are ok with attached reverting patch.
 
 Tim, Don,
 Can you try if attached reverting patch fix all the problems for you ?
 


Hi, Yinghai, Andrew

In the mails and the changlog of the revert-patch, I think Yinghai
mainly worries about 3 problems.

1) the current implement has bug and bad code.

Yes. Any bug should be fixed. we should fix it directly, or
we can revert the related patches and then send the fixed patches.

But the related patch is only one or two, it is not good idea
to revert the whole patchset or the whole feature. Right?

Thank you all for addressing the bug. we are on the way to fix it.

2) many memory can be put into hotplugable memory, but we have not yet moved 
them
   into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.

This is a restriction in the currently kernel, we can't convert them 
quickly.
we must convert them step by step. example, we are converting the 
memory of
page_cgroup to hotplugable memory.


3) if the user(or firmware) specify the un-hotplugable memory too small, the 
system can't
   work, even can't boot.

Any feature/system has its own minimum requirements, the user should
meet the requirements and specify more un-hotplugable memory.
so I don't think it is a problem in kernel land.

But the problem 2)(above) make this feature's minimum requirements
much higher. It is the real thing that Yinghai worries about.

But all systems which use this feature can offer this higher requirement
very easily. The users should specify enough un-hotplugable memory
before and after we decrease the minimum requirements.

The whole feature works very well if the user specify enough
un-hotplugable memory. So the problem 2) and 3) are not urgent
problems.

And our team has another problem, we are still not good at community work,
(example, the patch TITLE is total misleading), but we are growing up.
We are sorry and thank you for pointing out the mistakes.

The feature/patchset does have problems. But it is not good to tangle
all the problems together and revert the whole feature.

Thanks,
Lai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Yasuaki Ishimatsu

Hi Yinghai,

2013/02/26 15:57, Yinghai Lu wrote:

On Mon, Feb 25, 2013 at 10:09 PM, Tang Chen  wrote:

On 02/26/2013 12:51 PM, Martin Bligh wrote:


Do you mean we can remove numaq x86 32bit code now?



Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
Was useful in the early days of getting NUMA up and running on Linux,
but is now too old to be a museum piece, really.

M.



Hi Martin, Yinghai,

It was me that I failed to make numa_init() fall back path working, and
forgot
to call early_parse_srat in ia64. Sorry for the breaking of other platform.
:)

So now, is Yinghai's patch enough for this problem ?
Or we can encapsulate the following clear up work into one function ?


+   for (i = 0; i < MAX_LOCAL_APIC; i++)
+   set_apicid_to_node(i, NUMA_NO_NODE);
+   nodes_clear(numa_nodes_parsed);
+   memset(_meminfo, 0, sizeof(numa_meminfo));




That is temporary workaround and your patch and this workaround make
x86 acpi numa init too messy.

I don't see the point to hack SRAT to make memory hotplug working.

Do you guys check and use PMTT in ACPI spec instead?


I read PMTT specification in ACPI spec revision 5.0. But this table
does not have hotpluggable information. So we cannot know which memory
device can hotplug from this table.

Thanks,
Yasuaki Ishimatsu



Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Tang Chen

On 02/26/2013 02:57 PM, Yinghai Lu wrote:

That is temporary workaround and your patch and this workaround make
x86 acpi numa init too messy.

I don't see the point to hack SRAT to make memory hotplug working.

Do you guys check and use PMTT in ACPI spec instead?


Hi Yinghai,

Thanks for the suggestion. :)

The point we are using SRAT is that we need the hot-pluggable bit in SRAT.
I didn't find such info in PMTT or elsewhere.

We use SRAT in this way aims to satisfy users who don't want to specify
physical address ranges in kernel command line. They want to use SRAT to
determine which memory is hot-pluggable, and which is not.

To achieve this aim, we have to ensure we have the SRAT info before 
memblock

starts to allocate memory. So that we can prevent memblock from allocating
memory in the hot-pluggable area. So I have to parse SRAT earlier.

I don't think the code is that messy. I think we can encapsulate the clear
up job into one function, and call it where it is needed.

How do you think ?

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Yinghai Lu
On Mon, Feb 25, 2013 at 10:09 PM, Tang Chen  wrote:
> On 02/26/2013 12:51 PM, Martin Bligh wrote:
>>>
>>> Do you mean we can remove numaq x86 32bit code now?
>>
>>
>> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
>> Was useful in the early days of getting NUMA up and running on Linux,
>> but is now too old to be a museum piece, really.
>>
>> M.
>>
>
> Hi Martin, Yinghai,
>
> It was me that I failed to make numa_init() fall back path working, and
> forgot
> to call early_parse_srat in ia64. Sorry for the breaking of other platform.
> :)
>
> So now, is Yinghai's patch enough for this problem ?
> Or we can encapsulate the following clear up work into one function ?
>
>
> +   for (i = 0; i < MAX_LOCAL_APIC; i++)
> +   set_apicid_to_node(i, NUMA_NO_NODE);
> +   nodes_clear(numa_nodes_parsed);
> +   memset(_meminfo, 0, sizeof(numa_meminfo));
>
>

That is temporary workaround and your patch and this workaround make
x86 acpi numa init too messy.

I don't see the point to hack SRAT to make memory hotplug working.

Do you guys check and use PMTT in ACPI spec instead?

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Tang Chen

On 02/26/2013 12:51 PM, Martin Bligh wrote:

Do you mean we can remove numaq x86 32bit code now?


Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
Was useful in the early days of getting NUMA up and running on Linux,
but is now too old to be a museum piece, really.

M.



Hi Martin, Yinghai,

It was me that I failed to make numa_init() fall back path working, and 
forgot
to call early_parse_srat in ia64. Sorry for the breaking of other 
platform. :)


So now, is Yinghai's patch enough for this problem ?
Or we can encapsulate the following clear up work into one function ?

+   for (i = 0; i < MAX_LOCAL_APIC; i++)
+   set_apicid_to_node(i, NUMA_NO_NODE);
+   nodes_clear(numa_nodes_parsed);
+   memset(_meminfo, 0, sizeof(numa_meminfo));


Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Martin Bligh
> Do you mean we can remove numaq x86 32bit code now?

Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
Was useful in the early days of getting NUMA up and running on Linux,
but is now too old to be a museum piece, really.

M.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Yinghai Lu
On Mon, Feb 25, 2013 at 7:21 PM, Martin Bligh  wrote:
 4, it does not CC to TJ and other numa guys...
>>>
>>> attached workaround the problem for now.
>>> but it will assume NUMAQ would not have SRAT table.
>>
>>  Martin, can you confirm that numaq does not have srat?
>
> No, it's pre-SRAT. I forget the exact name of the table, but no SRAT until 
> x440.
>
> OTOH, you should probably feel free to break it by now, I can't
> imagine they are any use to man nor beast any more.

Do you mean we can remove numaq x86 32bit code now?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Martin Bligh
>>> 4, it does not CC to TJ and other numa guys...
>>
>> attached workaround the problem for now.
>> but it will assume NUMAQ would not have SRAT table.
>
>  Martin, can you confirm that numaq does not have srat?

No, it's pre-SRAT. I forget the exact name of the table, but no SRAT until x440.

OTOH, you should probably feel free to break it by now, I can't
imagine they are any use to man nor beast any more.

M.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Yinghai Lu
[ Add new address with Martin]

On Mon, Feb 25, 2013 at 4:35 PM, Yinghai Lu  wrote:
> On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu  wrote:
>> On Mon, Feb 25, 2013 at 1:27 PM, Don Morris  wrote:
>>> On 02/25/2013 10:32 AM, Tim Gardner wrote:
 On 02/25/2013 08:02 AM, Tim Gardner wrote:
> Is this an expected warning ? I'll boot a vanilla kernel just to be sure.
>
> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:
>

 Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
 is having an impact:
>>>
>>> Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
>>> still Sandy Bridge, though I don't think that matters).
>>>
>>> Bisection leads to:
>>> # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
>>> parse SRAT before memblock is ready
>>>
>>> Nothing terribly obvious leaps out as to *why* that reshuffling messes
>>> up the cpu<-->node bindings, but I wanted to put this out there while
>>> I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
>>> outs during boot are the same either way -- if you look at the APIC
>>> numbers of the processors (from /proc/cpuinfo), the processors should
>>> be assigned to the correct node, but they aren't.] cc'ing Tang Chen
>>> in case this is obvious to him or he's already fixed it somewhere not
>>> on Linus's tree yet.
>>>
>>> Don Morris
>>>

 [0.170435] [ cut here ]
 [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
 topology_sane.isra.2+0x71/0x84()
 [0.170452] Hardware name: S2600CP
 [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
 node! [node: 1 != 0]. Ignoring dependency.
 [0.156000] smpboot: Booting Node   1, Processors  #1
 [0.170455] Modules linked in:
 [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
 [0.170461] Call Trace:
 [0.170466]  [] warn_slowpath_common+0x7f/0xc0
 [0.170473]  [] warn_slowpath_fmt+0x46/0x50
 [0.170477]  [] topology_sane.isra.2+0x71/0x84
 [0.170482]  [] set_cpu_sibling_map+0x23f/0x436
 [0.170487]  [] start_secondary+0x137/0x201
 [0.170502] ---[ end trace 09222f596307ca1d ]---
>>
>> that commit is totally broken, and it should be reverted.
>>
>> 1. numa_init is called several times, NOT just for srat. so those
>>nodes_clear(numa_nodes_parsed)
>>memset(_meminfo, 0, sizeof(numa_meminfo))
>> can not be just removed.
>> please consider sequence is: numaq, srat, amd, dummy.
>> You need to make fall back path working!
>>
>> 2. simply split acpi_numa_init to early_parse_srat.
>> a. that early_parse_srat is NOT called for ia64, so you break ia64.
>> b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
>>  set_apicid_to_node(i, NUMA_NO_NODE)
>> still left in numa_init. So it will just clear result from early_parse_srat.
>> it should be moved before that
>>
>> 3. that patch TITLE is total misleading, there is NO x86 in the title,
>> but it changes
>> to x86 code.
>>
>> 4, it does not CC to TJ and other numa guys...
>
> attached workaround the problem for now.
> but it will assume NUMAQ would not have SRAT table.
>

 Martin, can you confirm that numaq does not have srat?

Thanks

Yinghai


x.patch
Description: Binary data


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Tang Chen


[0.170435] [ cut here ]
[0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
topology_sane.isra.2+0x71/0x84()
[0.170452] Hardware name: S2600CP
[0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
node! [node: 1 != 0]. Ignoring dependency.
[0.156000] smpboot: Booting Node   1, Processors  #1
[0.170455] Modules linked in:
[0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
[0.170461] Call Trace:
[0.170466]  [] warn_slowpath_common+0x7f/0xc0
[0.170473]  [] warn_slowpath_fmt+0x46/0x50
[0.170477]  [] topology_sane.isra.2+0x71/0x84
[0.170482]  [] set_cpu_sibling_map+0x23f/0x436
[0.170487]  [] start_secondary+0x137/0x201
[0.170502] ---[ end trace 09222f596307ca1d ]---


that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b.  for (i = 0; i<  MAX_LOCAL_APIC; i++)
  set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that

3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...


Hi Yinghai, Don,

OK, I see this. I'll fix it soon. :)

Thanks. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Yinghai Lu
On Mon, Feb 25, 2013 at 2:50 PM, Yinghai Lu  wrote:
> On Mon, Feb 25, 2013 at 1:27 PM, Don Morris  wrote:
>> On 02/25/2013 10:32 AM, Tim Gardner wrote:
>>> On 02/25/2013 08:02 AM, Tim Gardner wrote:
 Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

 rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:

>>>
>>> Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
>>> is having an impact:
>>
>> Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
>> still Sandy Bridge, though I don't think that matters).
>>
>> Bisection leads to:
>> # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
>> parse SRAT before memblock is ready
>>
>> Nothing terribly obvious leaps out as to *why* that reshuffling messes
>> up the cpu<-->node bindings, but I wanted to put this out there while
>> I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
>> outs during boot are the same either way -- if you look at the APIC
>> numbers of the processors (from /proc/cpuinfo), the processors should
>> be assigned to the correct node, but they aren't.] cc'ing Tang Chen
>> in case this is obvious to him or he's already fixed it somewhere not
>> on Linus's tree yet.
>>
>> Don Morris
>>
>>>
>>> [0.170435] [ cut here ]
>>> [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
>>> topology_sane.isra.2+0x71/0x84()
>>> [0.170452] Hardware name: S2600CP
>>> [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
>>> node! [node: 1 != 0]. Ignoring dependency.
>>> [0.156000] smpboot: Booting Node   1, Processors  #1
>>> [0.170455] Modules linked in:
>>> [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
>>> [0.170461] Call Trace:
>>> [0.170466]  [] warn_slowpath_common+0x7f/0xc0
>>> [0.170473]  [] warn_slowpath_fmt+0x46/0x50
>>> [0.170477]  [] topology_sane.isra.2+0x71/0x84
>>> [0.170482]  [] set_cpu_sibling_map+0x23f/0x436
>>> [0.170487]  [] start_secondary+0x137/0x201
>>> [0.170502] ---[ end trace 09222f596307ca1d ]---
>
> that commit is totally broken, and it should be reverted.
>
> 1. numa_init is called several times, NOT just for srat. so those
>nodes_clear(numa_nodes_parsed)
>memset(_meminfo, 0, sizeof(numa_meminfo))
> can not be just removed.
> please consider sequence is: numaq, srat, amd, dummy.
> You need to make fall back path working!
>
> 2. simply split acpi_numa_init to early_parse_srat.
> a. that early_parse_srat is NOT called for ia64, so you break ia64.
> b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
>  set_apicid_to_node(i, NUMA_NO_NODE)
> still left in numa_init. So it will just clear result from early_parse_srat.
> it should be moved before that
>
> 3. that patch TITLE is total misleading, there is NO x86 in the title,
> but it changes
> to x86 code.
>
> 4, it does not CC to TJ and other numa guys...

attached workaround the problem for now.
but it will assume NUMAQ would not have SRAT table.

Martin, can you confirm that numaq does not have srat?

Yinghai


x.patch
Description: Binary data


  1   2   >