Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-15 Thread Michal Hocko
On Tue 14-03-17 14:20:14, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 13:28:25 +0100
> Michal Hocko  wrote:
> 
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > > On Thu, 9 Mar 2017 13:54:00 +0100
> > > Michal Hocko  wrote:
[...]
> > > > The kernel is supposed to provide a proper API and that is sysfs
> > > > currently. I am not entirely happy about it either but pulling a lot of
> > > > code into the kernel is not the rigth thing to do. Especially when
> > > > different usecases require different treatment.  
> > >
> > > If it could be done from kernel side alone, it looks like a better way
> > > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > > possible to implement without adding a lot of kernel code.  
> > 
> > But this is not how we do the kernel development. We provide the API so
> > that userspace can implement the appropriate policy on top. We do not
> > add random knobs to implement the same thing in the kernel. Different
> > users might want to implement different onlining strategies and that is
> > hardly describable by a single global knob. Just look at the s390
> > example provided earlier. Please try to think out of your usecase scope.
>
> And could you think outside of legacy sysfs based onlining usecase scope?

Well, I always prefer a more generic solution which supports more
usecases and I am trying really hard to understand usecases you are
coming up with. So far I have heard that the current sysfs behavior is
broken (which is true!) and some very vague arguments about why we need
to online as quickly as possible to the point that userspace handling is
an absolute no go.

To be honest I still consider the later a non-issue. If the only thing
you care about is the memory footprint of the first phase then I believe
this is fixable. Memblock and section descriptors should be the only
necessary thing to allocate and that is not much.

As an aside, the more I think about the way the original authors
separated the physical hotadd from onlining the more I appreciate that
decision because the way how the memory can be online is definitely not
carved in stone and evolves with usecases. I believe nobody expected
that memory could be onlined as movable back then and I am pretty sure
new ways will emerge over time.
 
> I don't think that S390 comparing with x86 is correct as platforms
> and hardware implementations of memory hotplug are different with
> correspondingly different requirements, hence 
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> were introduced to allows platform specify behavior.

There are different usecases which are arch agnostic. E.g. decide the
movability based on some criterion (e.g. specific node, physical address
range and what not). Global auto onlining cannot handle those for obvious
reasons and a config option will not do achieve that for the same
reason.

> For x86/ARM(+ACPI) it's possible to implement hotplug in race free
> way inside kernel without userspace intervention, onlining memory
> using hardware vendor defined policy (ACPI SRAT/Memory device describe
> memory sufficiently to do it) so user won't have to do it manually,
> config option is a convenient way to enable new feature for platforms
> that could support it.

Sigh. Can you see the actual difference between the global kernel policy
and the policy coming from the specific hardware (ACPI etc...)? I am not
opposing auto onlining based on the ACPI attributes. But what we have
now is a policy _in_the_kernel_. This is almost always a bad idea and
I do not see any strong argument why it would be any different in this
particular case. Actually your current default in Fedora makes it harder
for anybody to use movable zones/nodes.

> It's good to maintain uniform API to userspace as far as API does
> the job, but being stuck to legacy way isn't good when
> there is a way (even though it's limited to limited set of platforms)
> to improve it by removing need for API, making overall less complex
> and race-less (more reliable) system.

then convince your virtualization platform to provide necessary data
for the memory auto onlining via ACPI etc...

> > > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> > > so we could continue on improving kernel only auto-onlining
> > > and fixing current memory hot(un)plug issues without affecting
> > > other platforms/users that are no interested in it.  
> > 
> > I really do not see any reason to keep the config option. Setting up
> > this to enabled is _wrong_ thing to do in general purpose
> > (distribution) kernel and a kernel for the specific usecase can achieve
> > the same thing via boot command line.
>
> I have to disagree with you that setting policy 'not online by default'
> in kernel is more valid than opposite policy 'online by default'.
> It maybe works for your usecases but it doesn't mean that it suits
> needs of others.

Well, as described above there are good reasons to not 

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-14 Thread Igor Mammedov
On Mon, 13 Mar 2017 13:28:25 +0100
Michal Hocko  wrote:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > On Thu, 9 Mar 2017 13:54:00 +0100
> > Michal Hocko  wrote:
> > 
> > [...]  
> > > > It's major regression if you remove auto online in kernels that
> > > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > > while breaking useful functionality doesn't make sense.
> > > > 
> > > > I would ACK config option removal if auto online keeps working
> > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > > and keep kernel CLI option to override default.
> > > > 
> > > > That doesn't mean that others will agree with flipping default,
> > > > that's why config option has been added.
> > > > 
> > > > Now to sum up what's been discussed on this thread, there were 2
> > > > different issues discussed:
> > > >   1) memory hotplug: remove in kernel auto online for all
> > > >  except of hyperv/xen
> > > > 
> > > >- suggested RFC is not acceptable from virt point of view
> > > >  as it regresses guests on top of x86 kvm/vmware which
> > > >  both use ACPI based memory hotplug.
> > > > 
> > > >- udev/userspace solution doesn't work in practice as it's
> > > >  too slow and unreliable when system is under load which
> > > >  is quite common in virt usecase. That's why auto online
> > > >  has been introduced in the first place.
> > > 
> > > Please try to be more specific why "too slow" is a problem. Also how
> > > much slower are we talking about?  
> >
> > In virt case on host with lots VMs, userspace handler
> > processing could be scheduled late enough to trigger a race
> > between (guest memory going away/OOM handler) and memory
> > coming online.  
> 
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.
It doesn't have to be user initiated action,
have you heard about pay as you go phone plans, same thing use case
applies to cloud environments where typically hotplug user initiated
action on baremetal could be easily automated to hotplug on demand.


> [...]
> > > >- currently if one wants to use online_movable,
> > > >  one has to either
> > > >* disable auto online in kernel OR
> > > 
> > > which might not just work because an unmovable allocation could have
> > > made the memblock pinned.  
> >
> > With memhp_default_state=offline on kernel CLI there won't be any
> > unmovable allocation as hotplugged memory won't be onlined and
> > user can online it manually. So it works for non default usecase
> > of playing with memory hot-unplug.  
> 
> I was talking about the case when the auto_online was true, of course,
> e.g. depending on the config option which you've said is enabled in
> Fedora kernels.
>
> [...] 
> > > >  I'm in favor of implementing that in kernel as it keeps
> > > >  kernel internals inside kernel and doesn't need
> > > >  kernel API to be involved (memory blocks in sysfs,
> > > >  online_kernel, online_movable)
> > > >  There would be no need in userspace which would have to
> > > >  deal with kernel zoo and maintain that as well.
> > > 
> > > The kernel is supposed to provide a proper API and that is sysfs
> > > currently. I am not entirely happy about it either but pulling a lot of
> > > code into the kernel is not the rigth thing to do. Especially when
> > > different usecases require different treatment.  
> >
> > If it could be done from kernel side alone, it looks like a better way
> > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > possible to implement without adding a lot of kernel code.  
> 
> But this is not how we do the kernel development. We provide the API so
> that userspace can implement the appropriate policy on top. We do not
> add random knobs to implement the same thing in the kernel. Different
> users might want to implement different onlining strategies and that is
> hardly describable by a single global knob. Just look at the s390
> example provided earlier. Please try to think out of your usecase scope.
And could you think outside of legacy sysfs based onlining usecase scope?

I don't think that S390 comparing with x86 is correct as platforms
and hardware implementations of memory hotplug are different with
correspondingly different requirements, hence 
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
were introduced to allows platform specify behavior.

For x86/ARM(+ACPI) it's possible to implement hotplug in race free
way inside kernel without userspace intervention, onlining memory

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
>> >
>> > What is the API those guests ask for the memory? And who is actually
>> > responsible to ask for that memory? Is it a kernel or userspace
>> > solution?
>> 
>> Whatever, this can even be a system administrator running
>> 'free'.
>
> I am pretty sure that 'free' will not give you additional memory but
> let's be serious here... 
> If this is solely about monitoring from
> userspace and requesting more memory from the userspace then I would
> consider arguing about timely hotplug operation as void because there is
> absolutely no guarantee to do the request itself in a timely fashion.
>
>> Hyper-V driver sends si_mem_available() and
>> vm_memory_committed() metrics to the host every second and this can be
>> later queried by any tool (e.g. powershell script).
>
> And how exactly is this related to the acpi hotplug which you were
> arguing needs the timely handling as well?
>

What I meant to say is that there is no single 'right' way to get memory
usage from a VM, make a decision somewhere (in the hypervisor, on some
other host or even in someone's head) and issue a command to add more
memory. I don't know what particular tools people use with ESX/KVM VMs
but I think that multiple options are available.

>> >> With udev-style memory onlining they should be aware of page
>> >> tables and other in-kernel structures which require allocation so they
>> >> need to add memory slowly and gradually or they risk running into OOM
>> >> (at least getting some processes killed and these processes may be
>> >> important). With in-kernel memory hotplug everything happens
>> >> synchronously and no 'slowly and gradually' algorithm is required in
>> >> all tools which may trigger memory hotplug.
>> >
>> > What prevents those APIs being used reasonably and only asks so much
>> > memory as they can afford? I mean 1.5% available memory necessary for
>> > the hotplug is not all that much. Or more precisely what prevents to ask
>> > for this additional memory in a synchronous way?
>> 
>> The knowledge about the fact that we need to add memory slowly and
>> wait till it gets onlined is not obvious.
>
> yes it is and we cannot afford to give a better experience with the
> implementation that requires to have memory to online a memory.

Actually, we need memory to add memory, not to online it. And as I said
before, this is a real issue which requires addressing, it should always
be possible to add more memory (and to online already added memory if
these events are separated).

>
>> AFAIR when you hotplug memory
>> to Windows VMs there is no such thing as 'onlining', and no brain is
>> required, a simple script 'low memory -> add mory memory' always
>> works. Asking all these script writers to think twice before issuing a
>> memory add command memory sounds like too much (to me).
>
> Pardon me, but not requiring a brain while doing something on Windows
> VMs is not really an argument...

Why? Windows (or any other OS) is just an example that things can be
done in a different way, otherwise we'll end up with arguments like "it
was always like that in Linux so it's good".

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> >> Michal Hocko  writes:
> >> 
> >> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> >> > > 
> >> >> > >- suggested RFC is not acceptable from virt point of view
> >> >> > >  as it regresses guests on top of x86 kvm/vmware which
> >> >> > >  both use ACPI based memory hotplug.
> >> >> > > 
> >> >> > >- udev/userspace solution doesn't work in practice as it's
> >> >> > >  too slow and unreliable when system is under load which
> >> >> > >  is quite common in virt usecase. That's why auto online
> >> >> > >  has been introduced in the first place.  
> >> >> > 
> >> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> >> > much slower are we talking about?
> >> >>
> >> >> In virt case on host with lots VMs, userspace handler
> >> >> processing could be scheduled late enough to trigger a race
> >> >> between (guest memory going away/OOM handler) and memory
> >> >> coming online.
> >> >
> >> > Either you are mixing two things together or this doesn't really make
> >> > much sense. So is this a balloning based on memory hotplug (aka active
> >> > memory hotadd initiated between guest and host automatically) or a guest
> >> > asking for additional memory by other means (pay more for memory etc.)?
> >> > Because if this is an administrative operation then I seriously question
> >> > this reasoning.
> >> 
> >> I'm probably repeating myself but it seems this point was lost:
> >> 
> >> This is not really a 'ballooning', it is just a pure memory
> >> hotplug. People may have any tools monitoring their VM memory usage and
> >> when a VM is running low on memory they may want to hotplug more memory
> >> to it.
> >
> > What is the API those guests ask for the memory? And who is actually
> > responsible to ask for that memory? Is it a kernel or userspace
> > solution?
> 
> Whatever, this can even be a system administrator running
> 'free'.

I am pretty sure that 'free' will not give you additional memory but
let's be serious here... If this is solely about monitoring from
userspace and requesting more memory from the userspace then I would
consider arguing about timely hotplug operation as void because there is
absolutely no guarantee to do the request itself in a timely fashion.

> Hyper-V driver sends si_mem_available() and
> vm_memory_committed() metrics to the host every second and this can be
> later queried by any tool (e.g. powershell script).

And how exactly is this related to the acpi hotplug which you were
arguing needs the timely handling as well?
 
> >> With udev-style memory onlining they should be aware of page
> >> tables and other in-kernel structures which require allocation so they
> >> need to add memory slowly and gradually or they risk running into OOM
> >> (at least getting some processes killed and these processes may be
> >> important). With in-kernel memory hotplug everything happens
> >> synchronously and no 'slowly and gradually' algorithm is required in
> >> all tools which may trigger memory hotplug.
> >
> > What prevents those APIs being used reasonably and only asks so much
> > memory as they can afford? I mean 1.5% available memory necessary for
> > the hotplug is not all that much. Or more precisely what prevents to ask
> > for this additional memory in a synchronous way?
> 
> The knowledge about the fact that we need to add memory slowly and
> wait till it gets onlined is not obvious.

yes it is and we cannot afford to give a better experience with the
implementation that requires to have memory to online a memory.

> AFAIR when you hotplug memory
> to Windows VMs there is no such thing as 'onlining', and no brain is
> required, a simple script 'low memory -> add mory memory' always
> works. Asking all these script writers to think twice before issuing a
> memory add command memory sounds like too much (to me).

Pardon me, but not requiring a brain while doing something on Windows
VMs is not really an argument...
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
>> 
>> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> >> > > 
>> >> > >- suggested RFC is not acceptable from virt point of view
>> >> > >  as it regresses guests on top of x86 kvm/vmware which
>> >> > >  both use ACPI based memory hotplug.
>> >> > > 
>> >> > >- udev/userspace solution doesn't work in practice as it's
>> >> > >  too slow and unreliable when system is under load which
>> >> > >  is quite common in virt usecase. That's why auto online
>> >> > >  has been introduced in the first place.  
>> >> > 
>> >> > Please try to be more specific why "too slow" is a problem. Also how
>> >> > much slower are we talking about?
>> >>
>> >> In virt case on host with lots VMs, userspace handler
>> >> processing could be scheduled late enough to trigger a race
>> >> between (guest memory going away/OOM handler) and memory
>> >> coming online.
>> >
>> > Either you are mixing two things together or this doesn't really make
>> > much sense. So is this a balloning based on memory hotplug (aka active
>> > memory hotadd initiated between guest and host automatically) or a guest
>> > asking for additional memory by other means (pay more for memory etc.)?
>> > Because if this is an administrative operation then I seriously question
>> > this reasoning.
>> 
>> I'm probably repeating myself but it seems this point was lost:
>> 
>> This is not really a 'ballooning', it is just a pure memory
>> hotplug. People may have any tools monitoring their VM memory usage and
>> when a VM is running low on memory they may want to hotplug more memory
>> to it.
>
> What is the API those guests ask for the memory? And who is actually
> responsible to ask for that memory? Is it a kernel or userspace
> solution?

Whatever, this can even be a system administrator running
'free'. Hyper-V driver sends si_mem_available() and
vm_memory_committed() metrics to the host every second and this can be
later queried by any tool (e.g. powershell script).

>
>> With udev-style memory onlining they should be aware of page
>> tables and other in-kernel structures which require allocation so they
>> need to add memory slowly and gradually or they risk running into OOM
>> (at least getting some processes killed and these processes may be
>> important). With in-kernel memory hotplug everything happens
>> synchronously and no 'slowly and gradually' algorithm is required in
>> all tools which may trigger memory hotplug.
>
> What prevents those APIs being used reasonably and only asks so much
> memory as they can afford? I mean 1.5% available memory necessary for
> the hotplug is not all that much. Or more precisely what prevents to ask
> for this additional memory in a synchronous way?

The knowledge about the fact that we need to add memory slowly and
wait till it gets onlined is not obvious. AFAIR when you hotplug memory
to Windows VMs there is no such thing as 'onlining', and no brain is
required, a simple script 'low memory -> add mory memory' always
works. Asking all these script writers to think twice before issuing a
memory add command memory sounds like too much (to me).

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> > > 
> >> > >- suggested RFC is not acceptable from virt point of view
> >> > >  as it regresses guests on top of x86 kvm/vmware which
> >> > >  both use ACPI based memory hotplug.
> >> > > 
> >> > >- udev/userspace solution doesn't work in practice as it's
> >> > >  too slow and unreliable when system is under load which
> >> > >  is quite common in virt usecase. That's why auto online
> >> > >  has been introduced in the first place.  
> >> > 
> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> > much slower are we talking about?
> >>
> >> In virt case on host with lots VMs, userspace handler
> >> processing could be scheduled late enough to trigger a race
> >> between (guest memory going away/OOM handler) and memory
> >> coming online.
> >
> > Either you are mixing two things together or this doesn't really make
> > much sense. So is this a balloning based on memory hotplug (aka active
> > memory hotadd initiated between guest and host automatically) or a guest
> > asking for additional memory by other means (pay more for memory etc.)?
> > Because if this is an administrative operation then I seriously question
> > this reasoning.
> 
> I'm probably repeating myself but it seems this point was lost:
> 
> This is not really a 'ballooning', it is just a pure memory
> hotplug. People may have any tools monitoring their VM memory usage and
> when a VM is running low on memory they may want to hotplug more memory
> to it.

What is the API those guests ask for the memory? And who is actually
responsible to ask for that memory? Is it a kernel or userspace
solution?

> With udev-style memory onlining they should be aware of page
> tables and other in-kernel structures which require allocation so they
> need to add memory slowly and gradually or they risk running into OOM
> (at least getting some processes killed and these processes may be
> important). With in-kernel memory hotplug everything happens
> synchronously and no 'slowly and gradually' algorithm is required in
> all tools which may trigger memory hotplug.

What prevents those APIs being used reasonably and only asks so much
memory as they can afford? I mean 1.5% available memory necessary for
the hotplug is not all that much. Or more precisely what prevents to ask
for this additional memory in a synchronous way?

And just to prevent from a further solution, I can see why _in-kernel_
hotplug based 'ballooning (or whatever you call an on demand memory hotplug
based on the memory pressure)' want's to be synchronous and that is why
my patch changed those onlined the memory explicitly. I am questioning
memory hotplug requested by admin/user space component to need any
special from kernel assistance becuase it is only a shortcut which can
be implemented from the userspace. I hope I've made myself clear
finally.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> > > 
>> > >- suggested RFC is not acceptable from virt point of view
>> > >  as it regresses guests on top of x86 kvm/vmware which
>> > >  both use ACPI based memory hotplug.
>> > > 
>> > >- udev/userspace solution doesn't work in practice as it's
>> > >  too slow and unreliable when system is under load which
>> > >  is quite common in virt usecase. That's why auto online
>> > >  has been introduced in the first place.  
>> > 
>> > Please try to be more specific why "too slow" is a problem. Also how
>> > much slower are we talking about?
>>
>> In virt case on host with lots VMs, userspace handler
>> processing could be scheduled late enough to trigger a race
>> between (guest memory going away/OOM handler) and memory
>> coming online.
>
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.

I'm probably repeating myself but it seems this point was lost:

This is not really a 'ballooning', it is just a pure memory
hotplug. People may have any tools monitoring their VM memory usage and
when a VM is running low on memory they may want to hotplug more memory
to it. With udev-style memory onlining they should be aware of page
tables and other in-kernel structures which require allocation so they
need to add memory slowly and gradually or they risk running into OOM
(at least getting some processes killed and these processes may be
important). With in-kernel memory hotplug everything happens
synchronously and no 'slowly and gradually' algorithm is required in
all tools which may trigger memory hotplug.

It's not about slowness, it's about being synchronous or
asynchronous. This is not related to the virtualization technology used,
the use-case is the same for all of them which support memory hotplug.

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> On Thu, 9 Mar 2017 13:54:00 +0100
> Michal Hocko  wrote:
> 
> [...]
> > > It's major regression if you remove auto online in kernels that
> > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > while breaking useful functionality doesn't make sense.
> > > 
> > > I would ACK config option removal if auto online keeps working
> > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > and keep kernel CLI option to override default.
> > > 
> > > That doesn't mean that others will agree with flipping default,
> > > that's why config option has been added.
> > > 
> > > Now to sum up what's been discussed on this thread, there were 2
> > > different issues discussed:
> > >   1) memory hotplug: remove in kernel auto online for all
> > >  except of hyperv/xen
> > > 
> > >- suggested RFC is not acceptable from virt point of view
> > >  as it regresses guests on top of x86 kvm/vmware which
> > >  both use ACPI based memory hotplug.
> > > 
> > >- udev/userspace solution doesn't work in practice as it's
> > >  too slow and unreliable when system is under load which
> > >  is quite common in virt usecase. That's why auto online
> > >  has been introduced in the first place.  
> > 
> > Please try to be more specific why "too slow" is a problem. Also how
> > much slower are we talking about?
>
> In virt case on host with lots VMs, userspace handler
> processing could be scheduled late enough to trigger a race
> between (guest memory going away/OOM handler) and memory
> coming online.

Either you are mixing two things together or this doesn't really make
much sense. So is this a balloning based on memory hotplug (aka active
memory hotadd initiated between guest and host automatically) or a guest
asking for additional memory by other means (pay more for memory etc.)?
Because if this is an administrative operation then I seriously question
this reasoning.

[...]
> > >- currently if one wants to use online_movable,
> > >  one has to either
> > >* disable auto online in kernel OR  
> > 
> > which might not just work because an unmovable allocation could have
> > made the memblock pinned.
>
> With memhp_default_state=offline on kernel CLI there won't be any
> unmovable allocation as hotplugged memory won't be onlined and
> user can online it manually. So it works for non default usecase
> of playing with memory hot-unplug.

I was talking about the case when the auto_online was true, of course,
e.g. depending on the config option which you've said is enabled in
Fedora kernels.
  
[...] 
> > >  I'm in favor of implementing that in kernel as it keeps
> > >  kernel internals inside kernel and doesn't need
> > >  kernel API to be involved (memory blocks in sysfs,
> > >  online_kernel, online_movable)
> > >  There would be no need in userspace which would have to
> > >  deal with kernel zoo and maintain that as well.  
> > 
> > The kernel is supposed to provide a proper API and that is sysfs
> > currently. I am not entirely happy about it either but pulling a lot of
> > code into the kernel is not the rigth thing to do. Especially when
> > different usecases require different treatment.
>
> If it could be done from kernel side alone, it looks like a better way
> to me not to involve userspace at all. And for ACPI based x86/ARM it's
> possible to implement without adding a lot of kernel code.

But this is not how we do the kernel development. We provide the API so
that userspace can implement the appropriate policy on top. We do not
add random knobs to implement the same thing in the kernel. Different
users might want to implement different onlining strategies and that is
hardly describable by a single global knob. Just look at the s390
example provided earlier. Please try to think out of your usecase scope.

> That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> so we could continue on improving kernel only auto-onlining
> and fixing current memory hot(un)plug issues without affecting
> other platforms/users that are no interested in it.

I really do not see any reason to keep the config option. Setting up
this to enabled is _wrong_ thing to do in general purpose
(distribution) kernel and a kernel for the specific usecase can achieve
the same thing via boot command line.

> (PS: I don't care much about sysfs knob for setting auto-onlining,
> as kernel CLI override with memhp_default_state seems
> sufficient to me)

That is good to hear! I would be OK with keeping the kernel command line
option until we resolve all the current issues with the hotplug.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Igor Mammedov
On Thu, 9 Mar 2017 13:54:00 +0100
Michal Hocko  wrote:

[...]
> > It's major regression if you remove auto online in kernels that
> > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > while breaking useful functionality doesn't make sense.
> > 
> > I would ACK config option removal if auto online keeps working
> > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > and keep kernel CLI option to override default.
> > 
> > That doesn't mean that others will agree with flipping default,
> > that's why config option has been added.
> > 
> > Now to sum up what's been discussed on this thread, there were 2
> > different issues discussed:
> >   1) memory hotplug: remove in kernel auto online for all
> >  except of hyperv/xen
> > 
> >- suggested RFC is not acceptable from virt point of view
> >  as it regresses guests on top of x86 kvm/vmware which
> >  both use ACPI based memory hotplug.
> > 
> >- udev/userspace solution doesn't work in practice as it's
> >  too slow and unreliable when system is under load which
> >  is quite common in virt usecase. That's why auto online
> >  has been introduced in the first place.  
> 
> Please try to be more specific why "too slow" is a problem. Also how
> much slower are we talking about?
In virt case on host with lots VMs, userspace handler
processing could be scheduled late enough to trigger a race
between (guest memory going away/OOM handler) and memory
coming online.

>  
> >   2) memory unplug: online memory as movable
> > 
> >- doesn't work currently with udev rule due to kernel
> >  issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> These should be fixed
>  
> >- could be fixed both for in kernel auto online and udev
> >  with following patch:
> >  https://bugzilla.redhat.com/attachment.cgi?id=1146332
> >  but fixing it this way exposes zone disbalance issues,
> >  which are not present in current kernel as blocks are
> >  onlined in Zone Normal. So this is area to work and
> >  improve on.
> > 
> >- currently if one wants to use online_movable,
> >  one has to either
> >* disable auto online in kernel OR  
> 
> which might not just work because an unmovable allocation could have
> made the memblock pinned.
With memhp_default_state=offline on kernel CLI there won't be any
unmovable allocation as hotplugged memory won't be onlined and
user can online it manually. So it works for non default usecase
of playing with memory hot-unplug.
 
> >* remove udev rule that distro ships
> >  AND write custom daemon that will be able to online
> >  block in right zone/order. So currently whole
> >  online_movable thing isn't working by default
> >  regardless of who onlines memory.  
> 
> my epxperience with onlining full nodes as movable shows this works just
> fine (with all the limitations of the movable zones but that is a
> separate thing). I haven't played with configurations where movable
> zones are sharing the node with other zones.
I don't have access to a such baremetal configuration to play
with anymore.


> >  I'm in favor of implementing that in kernel as it keeps
> >  kernel internals inside kernel and doesn't need
> >  kernel API to be involved (memory blocks in sysfs,
> >  online_kernel, online_movable)
> >  There would be no need in userspace which would have to
> >  deal with kernel zoo and maintain that as well.  
> 
> The kernel is supposed to provide a proper API and that is sysfs
> currently. I am not entirely happy about it either but pulling a lot of
> code into the kernel is not the rigth thing to do. Especially when
> different usecases require different treatment.
If it could be done from kernel side alone, it looks like a better way
to me not to involve userspace at all. And for ACPI based x86/ARM it's
possible to implement without adding a lot of kernel code.
That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
so we could continue on improving kernel only auto-onlining
and fixing current memory hot(un)plug issues without affecting
other platforms/users that are no interested in it.
(PS: I don't care much about sysfs knob for setting auto-onlining,
as kernel CLI override with memhp_default_state seems
sufficient to me)

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-10 Thread Daniel Kiper
Hey,

On Mon, Mar 06, 2017 at 03:54:17PM +0100, Michal Hocko wrote:

[...]

> So let's discuss the current memory hotplug shortcomings and get rid of
> the crud which developed on top. I will start by splitting up the patch
> into 3 parts. Do the auto online thing from the HyperV and xen balloning
> drivers and dropping the config option and finally drop the sysfs knob.
> The last patch might be NAKed and I can live with that as long as the
> reasoning is proper and there is a general consensus on that.

If it is not too late please CC me on the patches and relevant threads.

Daniel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-09 Thread Michal Hocko
On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> On Mon, 6 Mar 2017 15:54:17 +0100
> Michal Hocko  wrote:
> 
> > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
[...]
> > > in current mainline kernel it triggers following code path:
> > > 
> > > online_pages()
> > >   ...
> > >if (online_type == MMOP_ONLINE_KERNEL) {   
> > >   
> > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > _shift))
> > > return -EINVAL;  
> > 
> > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> pretty much, reproducer is above so try and see for yourself

I will play with this...
 
[...]
> > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> > > MAINTAINERS should be fixed if linux-api were to be CCed.  
> > 
> > user visible APIs _should_ be discussed at this mailing list regardless
> > what get_maintainer.pl says. This is not about who is the maintainer but
> > about getting as wide audience for things that would have to be
> > maintained basically for ever.
>
> How would random contributor know which list to CC?

This should have been brought up during the review process which was
less than sufficient in this case.

> > > > So unless this causes a major regression which would be hard to fix I
> > > > will submit the patch for inclusion.  
> > > it will be a major regression due to lack of daemon that
> > > could online fast and can't be killed on OOM. So this
> > > clean up patch does break used feature without providing
> > > a viable alternative.  
> > 
> > So let's discuss the current memory hotplug shortcomings and get rid of
> > the crud which developed on top. I will start by splitting up the patch
> > into 3 parts. Do the auto online thing from the HyperV and xen balloning
> > drivers and dropping the config option and finally drop the sysfs knob.
> > The last patch might be NAKed and I can live with that as long as the
> > reasoning is proper and there is a general consensus on that.
> PS: CC me on that patches too
> 
> It's major regression if you remove auto online in kernels that
> run on top of x86 kvm/vmware hypervisors, making API cleanups
> while breaking useful functionality doesn't make sense.
> 
> I would ACK config option removal if auto online keeps working
> for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> and keep kernel CLI option to override default.
> 
> That doesn't mean that others will agree with flipping default,
> that's why config option has been added.
> 
> Now to sum up what's been discussed on this thread, there were 2
> different issues discussed:
>   1) memory hotplug: remove in kernel auto online for all
>  except of hyperv/xen
> 
>- suggested RFC is not acceptable from virt point of view
>  as it regresses guests on top of x86 kvm/vmware which
>  both use ACPI based memory hotplug.
> 
>- udev/userspace solution doesn't work in practice as it's
>  too slow and unreliable when system is under load which
>  is quite common in virt usecase. That's why auto online
>  has been introduced in the first place.

Please try to be more specific why "too slow" is a problem. Also how
much slower are we talking about?
 
>   2) memory unplug: online memory as movable
> 
>- doesn't work currently with udev rule due to kernel
>  issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

These should be fixed
 
>- could be fixed both for in kernel auto online and udev
>  with following patch:
>  https://bugzilla.redhat.com/attachment.cgi?id=1146332
>  but fixing it this way exposes zone disbalance issues,
>  which are not present in current kernel as blocks are
>  onlined in Zone Normal. So this is area to work and
>  improve on.
> 
>- currently if one wants to use online_movable,
>  one has to either
>* disable auto online in kernel OR

which might not just work because an unmovable allocation could have
made the memblock pinned.

>* remove udev rule that distro ships
>  AND write custom daemon that will be able to online
>  block in right zone/order. So currently whole
>  online_movable thing isn't working by default
>  regardless of who onlines memory.

my epxperience with onlining full nodes as movable shows this works just
fine (with all the limitations of the movable zones but that is a
separate thing). I haven't played with configurations where movable
zones are sharing the node with other zones.

>  I'm in favor of implementing that in kernel as it keeps
>  kernel internals inside kernel and doesn't need
>  kernel API to be involved (memory blocks in sysfs,
>  online_kernel, online_movable)
>  There would be no need in userspace which would have to
>  deal with kernel zoo and maintain that as 

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-07 Thread Igor Mammedov
On Mon, 6 Mar 2017 15:54:17 +0100
Michal Hocko  wrote:

> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> > On Fri, 3 Mar 2017 09:27:23 +0100
> > Michal Hocko  wrote:
> >   
> > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:  
> > > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > > Michal Hocko  wrote:
> > > > 
> > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]

> > > > > memblocks. If that doesn't work I would call it a bug.
> > > > It's rather an implementation constrain than a bug
> > > > for details and workaround patch see
> > > >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
> > > 
> > > "You are not authorized to access bug #1314306"  
> > Sorry,
> > I've made it public, related comments and patch should be accessible now
> > (code snippets in BZ are based on older kernel but logic is still the same 
> > upstream)
> >
> > > could you paste the reasoning here please?  
> > sure here is reproducer:
> > start VM with CLI:
> >   qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
> >   -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 
> > \
> >   /path/to/guest_image
> > 
> > then in guest dimm1 blocks are from 32-39
> > 
> >   echo online_movable > /sys/devices/system/memory/memory32/state
> > -bash: echo: write error: Invalid argument
> > 
> > in current mainline kernel it triggers following code path:
> > 
> > online_pages()
> >   ...
> >if (online_type == MMOP_ONLINE_KERNEL) { 
> > 
> > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > _shift))
> > return -EINVAL;  
> 
> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
pretty much, reproducer is above so try and see for yourself

[...]
> [...]
> > > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > > the first to the last hotplugged block order. So now we would have
> > > > > > to write a daemon that would
> > > > > >  - watch for all blocks in hotplugged memory appear (how would it 
> > > > > > know)
> > > > > >  - online them in right order (order might also be different 
> > > > > > depending
> > > > > >on kernel version)
> > > > > >-- it becomes even more complicated in NUMA case when there are
> > > > > >   multiple zones and kernel would have to provide user-space
> > > > > >   with information about zone maps
> > > > > > 
> > > > > > In short current experience shows that userspace approach
> > > > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > > >fast and/or under memory pressure) when udev (or something else
> > > > > >might be killed)  
> > > > > 
> > > > > yeah and that is why the patch does the onlining from the kernel.
> > > > onlining in this patch is limited to hyperv and patch breaks
> > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > > hotplug path.
> > > 
> > > Those can use the udev or do you see any reason why they couldn't?  
> >
> > Reasons are above, under  and >> quotations, patch breaks
> > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> > user-space process could be killed if hotplugged memory isn't onlined
> > fast enough leading to service termination and/or memory not
> > being onlined at all (if udev is killed)  
> 
> OK, so from the discussion so far I have learned that this would be
> problem _only_ if we are trying to hotplug a _lot_ of memory at once
> (~1.5% of the online memory is needed).  I am kind of skeptical this is
> a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
> would eat half of the available memory which is still quite far from
> OOM)? Even if the memory balloning uses hotplug then such a grow sounds
> a bit excessive.
Slow and killable udev issue doesn't really depends on
amount of hotplugged memory since it's onlined in blocks
(128M for x64). Considering that it's currently onlined
as zone normal, kernel doesn't have any issues adding more
follow up blocks of memory.

> > Currently udev rule is not usable and one needs a daemon
> > which would correctly do onlining and keep zone balance
> > even for simple case usecase of 1 normal and 1 movable zone.
> > And it gets more complicated in case of multiple numa nodes
> > with multiple zones.  
> 
> That sounds to be more related to the current implementation than
> anything else and as such is not a reason to invent specific user
> visible api. Btw. you are talking about movable zones byt the auto
> onlining doesn't allow to auto online movable memory. So either I miss
> your point or I am utterly confused.
in current state neither does udev rule as memory is onlined
as NORMAL (x64 variant at least), which is the same as auto online
does now.

We are discussing 2 different issues here and thread got pretty
hard to follow. I'll try to sum up at results 

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-06 Thread Michal Hocko
On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> On Fri, 3 Mar 2017 09:27:23 +0100
> Michal Hocko  wrote:
> 
> > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > Michal Hocko  wrote:
> > >   
> > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > > [...]  
> > > > > When trying to support memory unplug on guest side in RHEL7,
> > > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > > added block doesn't work in case one wants to online it as movable.
> > > > > 
> > > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > > order to online blocks as movable depending on adjacent blocks zone.  
> > > > >   
> > > > 
> > > > Could you be more specific please? Setting online_movable from the udev
> > > > rule should just work regardless of the ordering or the state of other
> > > > memblocks. If that doesn't work I would call it a bug.  
> > > It's rather an implementation constrain than a bug
> > > for details and workaround patch see
> > >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> > 
> > "You are not authorized to access bug #1314306"
> Sorry,
> I've made it public, related comments and patch should be accessible now
> (code snippets in BZ are based on older kernel but logic is still the same 
> upstream)
>  
> > could you paste the reasoning here please?
> sure here is reproducer:
> start VM with CLI:
>   qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
>   -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
>   /path/to/guest_image
> 
> then in guest dimm1 blocks are from 32-39
> 
>   echo online_movable > /sys/devices/system/memory/memory32/state
> -bash: echo: write error: Invalid argument
> 
> in current mainline kernel it triggers following code path:
> 
> online_pages()
>   ...
>if (online_type == MMOP_ONLINE_KERNEL) {   
>   
> if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, _shift)) 
>
> return -EINVAL;

Are you sure? I would expect MMOP_ONLINE_MOVABLE here

>   zone_can_shift()
> ...
> if (idx < target) {   
>
> /* pages must be at end of current zone */
>
> if (pfn + nr_pages != zone_end_pfn(zone)) 
>
> return false;
> 
> since we are trying to online as movable not the last section in
> ZONE_NORMAL.
> 
> Here is what makes hotplugged memory end up in ZONE_NORMAL:
>  acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
>-> arch/x86/mm/init_64.c  
> 
>  /*
>   * Memory is added always to NORMAL zone. This means you will never get
>   * additional DMA/DMA32 memory.
>   */
>  int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  {
> ...
> struct zone *zone = pgdat->node_zones +
> zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> 
> i.e. all hot-plugged memory modules always go to ZONE_NORMAL
> and only the first/last block in zone is allowed to be moved
> to another zone. Patch [1] tries to fix issue by assigning
> removable memory resource to movable zone so hotplugged+removable
> blocks look like:
>   movable normal, movable, movable
> instead of current:
>   normal, normal, normal movable

Hmm, this code is confusing and clean as mud. I have to stare there some
more but AFAIK zones shouldn't have problems with holes so the only
thing we have to guarantee is that different zones do not overlap. So
this smells like a bug rather than the ineherent implementation
limitation.

[...]
> > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > the first to the last hotplugged block order. So now we would have
> > > > > to write a daemon that would
> > > > >  - watch for all blocks in hotplugged memory appear (how would it 
> > > > > know)
> > > > >  - online them in right order (order might also be different depending
> > > > >on kernel version)
> > > > >-- it becomes even more complicated in NUMA case when there are
> > > > >   multiple zones and kernel would have to provide user-space
> > > > >   with information about zone maps
> > > > > 
> > > > > In short current experience shows that userspace approach
> > > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > >fast and/or under memory pressure) when udev (or something else
> > > > >might be killed)
> > > > 
> > > > yeah and that is why the patch does the onlining from the kernel.  
> > > onlining in this patch is limited to hyperv and patch breaks
> > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > hotplug path.  
> > 
> > Those can use the udev or do you see any reason why they 

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-03 Thread Igor Mammedov
On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko  wrote:

> On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > On Thu, 2 Mar 2017 15:28:16 +0100
> > Michal Hocko  wrote:
> >   
> > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > [...]  
> > > > When trying to support memory unplug on guest side in RHEL7,
> > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > added block doesn't work in case one wants to online it as movable.
> > > > 
> > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > order to online blocks as movable depending on adjacent blocks zone.
> > > 
> > > Could you be more specific please? Setting online_movable from the udev
> > > rule should just work regardless of the ordering or the state of other
> > > memblocks. If that doesn't work I would call it a bug.  
> > It's rather an implementation constrain than a bug
> > for details and workaround patch see
> >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> "You are not authorized to access bug #1314306"
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same 
upstream)
 
> could you paste the reasoning here please?
sure here is reproducer:
start VM with CLI:
  qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
  -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
  /path/to/guest_image

then in guest dimm1 blocks are from 32-39

  echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
  ...
   if (online_type == MMOP_ONLINE_KERNEL) { 
if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, _shift))   
 
return -EINVAL;

  zone_can_shift()
...
if (idx < target) { 
 
/* pages must be at end of current zone */  
 
if (pfn + nr_pages != zone_end_pfn(zone))   
 
return false;

since we are trying to online as movable not the last section in
ZONE_NORMAL.

Here is what makes hotplugged memory end up in ZONE_NORMAL:
 acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
   -> arch/x86/mm/init_64.c  

 /*
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
 int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 {
...
struct zone *zone = pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);

i.e. all hot-plugged memory modules always go to ZONE_NORMAL
and only the first/last block in zone is allowed to be moved
to another zone. Patch [1] tries to fix issue by assigning
removable memory resource to movable zone so hotplugged+removable
blocks look like:
  movable normal, movable, movable
instead of current:
  normal, normal, normal movable

but then with this fixed as suggested, auto online by default
should work just fine in kernel with normal and movable zones
without any need for user-space.

> > patch attached there is limited by another memory hotplug
> > issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> > on configuration where the most of memory is hot-removable
> > kernel might experience lack of memory in zone NORMAL.  
> 
> yes and that is an inherent problem of movable memory.
> 
> > > > Which means simple udev rule isn't usable since it gets event from
> > > > the first to the last hotplugged block order. So now we would have
> > > > to write a daemon that would
> > > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > > >  - online them in right order (order might also be different depending
> > > >on kernel version)
> > > >-- it becomes even more complicated in NUMA case when there are
> > > >   multiple zones and kernel would have to provide user-space
> > > >   with information about zone maps
> > > > 
> > > > In short current experience shows that userspace approach
> > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > >fast and/or under memory pressure) when udev (or something else
> > > >might be killed)
> > > 
> > > yeah and that is why the patch does the onlining from the kernel.  
> > onlining in this patch is limited to hyperv and patch breaks
> > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > hotplug path.  
> 
> Those can use the udev or do you see any reason why they couldn't?
Reasons are above, under  and >> quotations, patch breaks
what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
user-space process could be 

Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-03 Thread Michal Hocko
On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> On Thu, 2 Mar 2017 15:28:16 +0100
> Michal Hocko  wrote:
> 
> > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > [...]
> > > When trying to support memory unplug on guest side in RHEL7,
> > > experience shows otherwise. Simplistic udev rule which onlines
> > > added block doesn't work in case one wants to online it as movable.
> > > 
> > > Hotplugged blocks in current kernel should be onlined in reverse
> > > order to online blocks as movable depending on adjacent blocks zone.  
> > 
> > Could you be more specific please? Setting online_movable from the udev
> > rule should just work regardless of the ordering or the state of other
> > memblocks. If that doesn't work I would call it a bug.
> It's rather an implementation constrain than a bug
> for details and workaround patch see
>  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

"You are not authorized to access bug #1314306"

could you paste the reasoning here please?

> patch attached there is limited by another memory hotplug
> issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> on configuration where the most of memory is hot-removable
> kernel might experience lack of memory in zone NORMAL.

yes and that is an inherent problem of movable memory.

> > > Which means simple udev rule isn't usable since it gets event from
> > > the first to the last hotplugged block order. So now we would have
> > > to write a daemon that would
> > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > >  - online them in right order (order might also be different depending
> > >on kernel version)
> > >-- it becomes even more complicated in NUMA case when there are
> > >   multiple zones and kernel would have to provide user-space
> > >   with information about zone maps
> > > 
> > > In short current experience shows that userspace approach
> > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > >fast and/or under memory pressure) when udev (or something else
> > >might be killed)  
> > 
> > yeah and that is why the patch does the onlining from the kernel.
> onlining in this patch is limited to hyperv and patch breaks
> auto-online on x86 kvm/vmware/baremetal as they reuse the same
> hotplug path.

Those can use the udev or do you see any reason why they couldn't?

> > > > Can you imagine any situation when somebody actually might want to have
> > > > this knob enabled? From what I understand it doesn't seem to be the
> > > > case.  
> > > For x86:
> > >  * this config option is enabled by default in recent Fedora,  
> > 
> > How do you want to support usecases which really want to online memory
> > as movable? Do you expect those users to disable the option because
> > unless I am missing something the in kernel auto onlining only supporst
> > regular onlining.
>
> current auto onlining config option does what it's been designed for,
> i.e. it onlines hotplugged memory.
> It's possible for non average Fedora user to override default
> (commit 86dd995d6) if she/he needs non default behavior
> (i.e. user knows how to online manually and/or can write
> a daemon that would handle all of nuances of kernel in use).
> 
> For the rest when Fedora is used in cloud and user increases memory
> via management interface of whatever cloud she/he uses, it just works.
> 
> So it's choice of distribution to pick its own default that makes
> majority of user-base happy and this patch removes it without taking
> that in consideration.

You still can have a udev rule to achive the same thing for
non-ballooning based hotplug.

> How to online memory is different issue not related to this patch,
> current default onlining as ZONE_NORMAL works well for scaling
> up VMs.
> 
> Memory unplug is rather new and it doesn't work reliably so far,
> moving onlining to user-space won't really help. Further work
> is need to be done so that it would work reliably.

The main problem I have with this is that this is a limited usecase
driven configuration knob which doesn't work properly for other usecases
(namely movable online once your distribution choses to set the config
option to auto online). There is a userspace solution for this so this
shouldn't have been merged in the first place! It sneaked a proper review
process (linux-api wasn't CC to get a broader attenttion) which is
really sad.

So unless this causes a major regression which would be hard to fix I
will submit the patch for inclusion.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-02 Thread Igor Mammedov
On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko  wrote:

> On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> [...]
> > When trying to support memory unplug on guest side in RHEL7,
> > experience shows otherwise. Simplistic udev rule which onlines
> > added block doesn't work in case one wants to online it as movable.
> > 
> > Hotplugged blocks in current kernel should be onlined in reverse
> > order to online blocks as movable depending on adjacent blocks zone.  
> 
> Could you be more specific please? Setting online_movable from the udev
> rule should just work regardless of the ordering or the state of other
> memblocks. If that doesn't work I would call it a bug.
It's rather an implementation constrain than a bug
for details and workaround patch see
 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
patch attached there is limited by another memory hotplug
issue, which is NORMAL/MOVABLE zone balance, if kernel runs
on configuration where the most of memory is hot-removable
kernel might experience lack of memory in zone NORMAL.

> 
> > Which means simple udev rule isn't usable since it gets event from
> > the first to the last hotplugged block order. So now we would have
> > to write a daemon that would
> >  - watch for all blocks in hotplugged memory appear (how would it know)
> >  - online them in right order (order might also be different depending
> >on kernel version)
> >-- it becomes even more complicated in NUMA case when there are
> >   multiple zones and kernel would have to provide user-space
> >   with information about zone maps
> > 
> > In short current experience shows that userspace approach
> >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> >fast and/or under memory pressure) when udev (or something else
> >might be killed)  
> 
> yeah and that is why the patch does the onlining from the kernel.
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.

> > > Can you imagine any situation when somebody actually might want to have
> > > this knob enabled? From what I understand it doesn't seem to be the
> > > case.  
> > For x86:
> >  * this config option is enabled by default in recent Fedora,  
> 
> How do you want to support usecases which really want to online memory
> as movable? Do you expect those users to disable the option because
> unless I am missing something the in kernel auto onlining only supporst
> regular onlining.
current auto onlining config option does what it's been designed for,
i.e. it onlines hotplugged memory.
It's possible for non average Fedora user to override default
(commit 86dd995d6) if she/he needs non default behavior
(i.e. user knows how to online manually and/or can write
a daemon that would handle all of nuances of kernel in use).

For the rest when Fedora is used in cloud and user increases memory
via management interface of whatever cloud she/he uses, it just works.

So it's choice of distribution to pick its own default that makes
majority of user-base happy and this patch removes it without taking
that in consideration.

How to online memory is different issue not related to this patch,
current default onlining as ZONE_NORMAL works well for scaling
up VMs.

Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.

Now about the question of onlining removable memory as movable,
x86 kernel is able to get info, if hotadded memory is removable,
from ACPI subsystem and online it as movable one without any
intervention from user-space where it's hard to do so,
as patch[1] shows.
Problem is still researched and when we figure out how to fix
hot-remove issues we might enable auto-onlining by default for x86.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-02 Thread Michal Hocko
On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]
> When trying to support memory unplug on guest side in RHEL7,
> experience shows otherwise. Simplistic udev rule which onlines
> added block doesn't work in case one wants to online it as movable.
> 
> Hotplugged blocks in current kernel should be onlined in reverse
> order to online blocks as movable depending on adjacent blocks zone.

Could you be more specific please? Setting online_movable from the udev
rule should just work regardless of the ordering or the state of other
memblocks. If that doesn't work I would call it a bug.

> Which means simple udev rule isn't usable since it gets event from
> the first to the last hotplugged block order. So now we would have
> to write a daemon that would
>  - watch for all blocks in hotplugged memory appear (how would it know)
>  - online them in right order (order might also be different depending
>on kernel version)
>-- it becomes even more complicated in NUMA case when there are
>   multiple zones and kernel would have to provide user-space
>   with information about zone maps
> 
> In short current experience shows that userspace approach
>  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
>fast and/or under memory pressure) when udev (or something else
>might be killed)

yeah and that is why the patch does the onlining from the kernel.
 
> > Can you imagine any situation when somebody actually might want to have
> > this knob enabled? From what I understand it doesn't seem to be the
> > case.
> For x86:
>  * this config option is enabled by default in recent Fedora,

How do you want to support usecases which really want to online memory
as movable? Do you expect those users to disable the option because
unless I am missing something the in kernel auto onlining only supporst
regular onlining.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-02 Thread Igor Mammedov
On Mon 27-02-17 16:43:04, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:  
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.  
> > 
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> > 
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> > 
> > Auto-onlining of standby memory is the last thing we want.
I don't know much about anything other than x86 so all comments
below are from that point of view,
archetectures that don't need auto online can keep current default

> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.  
> > 
> > This also is a reason why auto-onlining doesn't seem be the best way.  

When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
 - watch for all blocks in hotplugged memory appear (how would it know)
 - online them in right order (order might also be different depending
   on kernel version)
   -- it becomes even more complicated in NUMA case when there are
  multiple zones and kernel would have to provide user-space
  with information about zone maps

In short current experience shows that userspace approach
 - doesn't solve issues that Vitaly has been fixing (i.e. onlining
   fast and/or under memory pressure) when udev (or something else
   might be killed)
 - doesn't reduce overall system complexity, it only gets worse
   as user-space handler needs to know a lot about kernel internals
   and implementation details/kernel versions to work properly

It's might be not easy but doing onlining in kernel on the other hand is:
 - faster
 - more reliable (can't be killed under memory pressure)
 - kernel has access to all info needed for onlining and how it
   internals work to implement auto-online correctly
 - since there is no need to mantain ABI for user-space
   (zones layout/ordering/maybe something else), kernel is
   free change internal implemetation without breaking userspace
   (currently hotplug+unplug doesn't work reliably and we might
need something more flexible than zones)
That's direction of research in progress, i.e. making kernel
implementation better instead of putting responsibility on
user-space to deal with kernel shortcomings.

> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.
For x86:
 * this config option is enabled by default in recent Fedora,
 * RHEL6 ships similar downstream patches to do the same thing for years
 * RHEL7 has udev rule (because there wasn't kernel side solution at fork time)
   that auto-onlines it unconditionally, Vitaly might backport it later
   when he has time.
Not linux kernel but still auto online policy is used by Windows
both on baremetal and guest configurations.

That's somewhat shows that current defaults upstream on x86
might be not what end-users wish for.

When auto_online_blocks were introduced, Vitaly has been
conservative and left current upstream defaults where they were
lest it would break someone else setup but allowing downstreams
set their own auto-online policy, eventually we might switch it
upstream too.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-28 Thread Heiko Carstens
On Mon, Feb 27, 2017 at 04:43:04PM +0100, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.
> > 
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> > 
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> > 
> > Auto-onlining of standby memory is the last thing we want.
> > 
> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.
> > 
> > This also is a reason why auto-onlining doesn't seem be the best way.
> 
> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.

I can only speak for s390, and at least here I think auto-online is always
wrong, especially if you consider the added complexity that you may want to
online memory sometimes to ZONE_NORMAL and sometimes to ZONE_MOVABLE.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Michal Hocko
On Mon 27-02-17 11:28:52, Reza Arbab wrote:
> On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote:
> >diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >index 134a2f69c21a..a72f7f64ee26 100644
> >--- a/include/linux/memory_hotplug.h
> >+++ b/include/linux/memory_hotplug.h
> >@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);
> >
> >extern int try_online_node(int nid);
> >
> >-extern bool memhp_auto_online;
> >-
> >#ifdef CONFIG_MEMORY_HOTREMOVE
> >extern bool is_pageblock_removable_nolock(struct page *page);
> >extern int arch_remove_memory(u64 start, u64 size);
> >@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 
> >size) {}
> >
> >extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > void *arg, int (*func)(struct memory_block *, void *));
> >-extern int add_memory(int nid, u64 start, u64 size);
> >+extern int add_memory(int nid, u64 start, u64 size, bool online);
> >extern int add_memory_resource(int nid, struct resource *resource, bool 
> >online);
> >extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
> > bool for_device);
> 
> It would be nice if instead of a 'bool online' argument, add_memory() and
> add_memory_resource() took an 'int online_type', ala online_pages().
> 
> That way we could specify offline, online, online+movable, etc.

Sure that would require more changes though and as such it is out of
scope of this patch. But you are right, this is a logical follow up
step.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Reza Arbab
On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote: 

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..a72f7f64ee26 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);

extern int try_online_node(int nid);

-extern bool memhp_auto_online;
-
#ifdef CONFIG_MEMORY_HOTREMOVE
extern bool is_pageblock_removable_nolock(struct page *page);
extern int arch_remove_memory(u64 start, u64 size);
@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 
size) {}

extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
-extern int add_memory(int nid, u64 start, u64 size);
+extern int add_memory(int nid, u64 start, u64 size, bool online);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
bool for_device);


It would be nice if instead of a 'bool online' argument, add_memory() 
and add_memory_resource() took an 'int online_type', ala online_pages().


That way we could specify offline, online, online+movable, etc.

--
Reza Arbab


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Michal Hocko
On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > A couple of other thoughts:
> > 1) Having all newly added memory online ASAP is probably what people
> > want for all virtual machines.
> 
> This is not true for s390. On s390 we have "standby" memory that a guest
> sees and potentially may use if it sets it online. Every guest that sets
> memory offline contributes to the hypervisor's standby memory pool, while
> onlining standby memory takes memory away from the standby pool.
> 
> The use-case is that a system administrator in advance knows the maximum
> size a guest will ever have and also defines how much memory should be used
> at boot time. The difference is standby memory.
> 
> Auto-onlining of standby memory is the last thing we want.
> 
> > Unfortunately, we have additional complexity with memory zones
> > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > required. Especially, when further unplug is expected.
> 
> This also is a reason why auto-onlining doesn't seem be the best way.

Can you imagine any situation when somebody actually might want to have
this knob enabled? From what I understand it doesn't seem to be the
case.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
>> 
>> > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
>> > [...]
>> >> I don't have anything new to add to the discussion happened last week
>> >> but I'd like to summarize my arguments against this change:
>> >> 
>> >> 1) This patch doesn't solve any issue. Configuration option is not an
>> >> issue by itself, it is an option for distros to decide what they want to
>> >> ship: udev rule with known issues (legacy mode) or enable the new
>> >> option. Distro makers and users building their kernels should be able to
>> >> answer this simple question "do you want to automatically online all
>> >> newly added memory or not".
>> >
>> > OK, so could you be more specific? Distributions have no clue about
>> > which HW their kernel runs on so how can they possibly make a sensible
>> > decision here?
>> 
>> They at least have an idea if they ship udev rule or not. I can also
>> imagine different choices for non-x86 architectures but I don't know
>> enough about them to have an opinion.
>
> I really do not follow. If they know whether they ship the udev rule
> then why do they need a kernel help at all? Anyway this global policy
> actually breaks some usecases. Say you would have a default set to
> online. What should user do if _some_ nodes should be online_movable?
> Or, say that HyperV or other hotplug based ballooning implementation
> really want to online the movable memory in order to have a realiable
> hotremove. Now you have a global policy which goes against it.
>

While I think that hotremove is a special case which really requires
manual intervention (at least to decide which memory goes NORMAL and
which MOVABLE), MEMORY_HOTPLUG_DEFAULT_ONLINE is probably not for it.

[snip]

>
>> The difference with real hardware is how the operation is performed:
>> with real hardware you need to take a DIMM, go to your server room, open
>> the box, insert DIMM, go back to your seat. Asking to do some manual
>> action to actually enable memory is kinda OK. The beauty of hypervisors
>> is that everything happens automatically (e.g. when the VM is running
>> out of memory).
>
> I do not see your point. Either you have some (semi)automatic way to
> balance memory in guest based on the memory pressure (let's call it
> ballooning) or this is an administration operation (say you buy more
> DIMs or pay more to your virtualization provider) and then it is up to
> the guest owner to tell what to do about that memory. In other words you
> really do not want to wait in the first case as you are under memory
> pressure which is _actively_ managed or this is much more relaxed
> environment.

I don't see a contradiction between what I say and what you say here :-)
Yes, there are case when we're not in a hurry and there are cases when
we can't wait.

>
>> >> 3) Kernel command line is not a viable choice, it is rather a debug
>> >> method.
>> >
>> > Why?
>> >
>> 
>> Because we usually have just a few things there (root=, console=) and
>> the rest is used when something goes wrong or for 'special' cases, not
>> for the majority of users.
>
> auto online or even memory hotplug seems something that requires
> a special HW/configuration already so I fail to see your point. It is
> normal to put kernel parameters to override the default. And AFAIU
> default offline is a sensible default for the standard memory hotplug.
>

It depends how we define 'standard'. The point I'm trying to make is
that it's really common for VMs to use this technique while in hardware
(x86) world it is a rare occasion. The 'sensible default' may differ.

> [...]
>
>> >> 2) Adding new memory can (in some extreme cases) still fail as we need
>> >> some *other* memory before we're able to online the newly added
>> >> block. This is an issue to be solved and it is doable (IMO) with some
>> >> pre-allocation.
>> >
>> > you cannot preallocate for all the possible memory that can be added.
>> 
>> For all, no, but for 1 next block - yes, and then I'll preallocate for
>> the next one.
>
> You are still thinking in the scope of your particular use case and I
> believe the whole thing is shaped around that very same thing and that
> is why it should have been rejected in the first place. Especially when
> that use case can be handled without user visible configuration knob.

I think my use case is broad enough. At least it applies to all
virtualization technoligies and not only to Hyper-V. But yes, I agree
that adding a parameter to add_memory() solves my particular use case as
well.

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Michal Hocko
On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
> > [...]
> >> I don't have anything new to add to the discussion happened last week
> >> but I'd like to summarize my arguments against this change:
> >> 
> >> 1) This patch doesn't solve any issue. Configuration option is not an
> >> issue by itself, it is an option for distros to decide what they want to
> >> ship: udev rule with known issues (legacy mode) or enable the new
> >> option. Distro makers and users building their kernels should be able to
> >> answer this simple question "do you want to automatically online all
> >> newly added memory or not".
> >
> > OK, so could you be more specific? Distributions have no clue about
> > which HW their kernel runs on so how can they possibly make a sensible
> > decision here?
> 
> They at least have an idea if they ship udev rule or not. I can also
> imagine different choices for non-x86 architectures but I don't know
> enough about them to have an opinion.

I really do not follow. If they know whether they ship the udev rule
then why do they need a kernel help at all? Anyway this global policy
actually breaks some usecases. Say you would have a default set to
online. What should user do if _some_ nodes should be online_movable?
Or, say that HyperV or other hotplug based ballooning implementation
really want to online the movable memory in order to have a realiable
hotremove. Now you have a global policy which goes against it.

> >> There are distros already which ship kernels
> >> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
> >> far as I remember, maybe someone else).
> >> 
> >> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
> >> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
> >> memory won't get onlined. I don't understand how this problem is
> >> supposed to be solved by distros. They'll *have to* continue shipping
> >> a udev rule which has and always will have issues.
> >
> > They have notifications for udev to online that memory and AFAICU
> > neither KVM nor VMware are using memory hotplut for ballooning - unlike
> > HyperV and Xen.
> >
> 
> No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is
> just a memory hotplug. The fact that the code is located in hv_balloon
> is just a coincidence.

OK, I might be wrong here but 1cac8cd4d146 ("Drivers: hv: balloon:
Implement hot-add functionality") suggests otherwise.

> The difference with real hardware is how the operation is performed:
> with real hardware you need to take a DIMM, go to your server room, open
> the box, insert DIMM, go back to your seat. Asking to do some manual
> action to actually enable memory is kinda OK. The beauty of hypervisors
> is that everything happens automatically (e.g. when the VM is running
> out of memory).

I do not see your point. Either you have some (semi)automatic way to
balance memory in guest based on the memory pressure (let's call it
ballooning) or this is an administration operation (say you buy more
DIMs or pay more to your virtualization provider) and then it is up to
the guest owner to tell what to do about that memory. In other words you
really do not want to wait in the first case as you are under memory
pressure which is _actively_ managed or this is much more relaxed
environment.
 
> >> 3) Kernel command line is not a viable choice, it is rather a debug
> >> method.
> >
> > Why?
> >
> 
> Because we usually have just a few things there (root=, console=) and
> the rest is used when something goes wrong or for 'special' cases, not
> for the majority of users.

auto online or even memory hotplug seems something that requires
a special HW/configuration already so I fail to see your point. It is
normal to put kernel parameters to override the default. And AFAIU
default offline is a sensible default for the standard memory hotplug.

[...]

> >> 2) Adding new memory can (in some extreme cases) still fail as we need
> >> some *other* memory before we're able to online the newly added
> >> block. This is an issue to be solved and it is doable (IMO) with some
> >> pre-allocation.
> >
> > you cannot preallocate for all the possible memory that can be added.
> 
> For all, no, but for 1 next block - yes, and then I'll preallocate for
> the next one.

You are still thinking in the scope of your particular use case and I
believe the whole thing is shaped around that very same thing and that
is why it should have been rejected in the first place. Especially when
that use case can be handled without user visible configuration knob.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Vitaly Kuznetsov
Heiko Carstens  writes:

> On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
>> A couple of other thoughts:
>> 1) Having all newly added memory online ASAP is probably what people
>> want for all virtual machines.

Sorry, obviously missed 'x86' in the above statement.

>
> This is not true for s390. On s390 we have "standby" memory that a guest
> sees and potentially may use if it sets it online. Every guest that sets
> memory offline contributes to the hypervisor's standby memory pool, while
> onlining standby memory takes memory away from the standby pool.
>
> The use-case is that a system administrator in advance knows the maximum
> size a guest will ever have and also defines how much memory should be used
> at boot time. The difference is standby memory.
>
> Auto-onlining of standby memory is the last thing we want.
>

This is actually a very good example of why we need the config
option. s390 kernels will have it disabled.

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Heiko Carstens
On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> A couple of other thoughts:
> 1) Having all newly added memory online ASAP is probably what people
> want for all virtual machines.

This is not true for s390. On s390 we have "standby" memory that a guest
sees and potentially may use if it sets it online. Every guest that sets
memory offline contributes to the hypervisor's standby memory pool, while
onlining standby memory takes memory away from the standby pool.

The use-case is that a system administrator in advance knows the maximum
size a guest will ever have and also defines how much memory should be used
at boot time. The difference is standby memory.

Auto-onlining of standby memory is the last thing we want.

> Unfortunately, we have additional complexity with memory zones
> (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> required. Especially, when further unplug is expected.

This also is a reason why auto-onlining doesn't seem be the best way.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
> [...]
>> I don't have anything new to add to the discussion happened last week
>> but I'd like to summarize my arguments against this change:
>> 
>> 1) This patch doesn't solve any issue. Configuration option is not an
>> issue by itself, it is an option for distros to decide what they want to
>> ship: udev rule with known issues (legacy mode) or enable the new
>> option. Distro makers and users building their kernels should be able to
>> answer this simple question "do you want to automatically online all
>> newly added memory or not".
>
> OK, so could you be more specific? Distributions have no clue about
> which HW their kernel runs on so how can they possibly make a sensible
> decision here?

They at least have an idea if they ship udev rule or not. I can also
imagine different choices for non-x86 architectures but I don't know
enough about them to have an opinion.

>
>> There are distros already which ship kernels
>> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
>> far as I remember, maybe someone else).
>> 
>> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
>> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
>> memory won't get onlined. I don't understand how this problem is
>> supposed to be solved by distros. They'll *have to* continue shipping
>> a udev rule which has and always will have issues.
>
> They have notifications for udev to online that memory and AFAICU
> neither KVM nor VMware are using memory hotplut for ballooning - unlike
> HyperV and Xen.
>

No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is
just a memory hotplug. The fact that the code is located in hv_balloon
is just a coincidence.

The difference with real hardware is how the operation is performed:
with real hardware you need to take a DIMM, go to your server room, open
the box, insert DIMM, go back to your seat. Asking to do some manual
action to actually enable memory is kinda OK. The beauty of hypervisors
is that everything happens automatically (e.g. when the VM is running
out of memory).

>> 3) Kernel command line is not a viable choice, it is rather a debug
>> method.
>
> Why?
>

Because we usually have just a few things there (root=, console=) and
the rest is used when something goes wrong or for 'special' cases, not
for the majority of users.

>> Having all newly added memory online as soon as possible is a
>> major use-case not something a couple of users wants (and this is
>> proved by major distros shipping the unconditional 'offline->online'
>> rule with udev).
>
> I would argue because this really depends on the usecase. a) somebody
> might want to online memory as movable and that really depends on which
> node we are talking about because not all of them can be movable

This is possible and that's why I introduce kernel command line options
back then. To simplify, I argue that the major use-case is 'online ASAP,
never offline' and for other use-cases we have options, both for distros
(config) and for users (command-line)


> b) it
> is easier to handle potential errors from userspace than the kernel.
>

Yes, probably, but memory hotplug was around for quite some time and I
didn't see anything but the dump udev rule (offline->online) without any
handling. And I think that we should rather focus on fixing potential
issues and making failures less probable (e.g. it's really hard to come
up with something different from 'failed->retry').

>> A couple of other thoughts:
>> 1) Having all newly added memory online ASAP is probably what people
>> want for all virtual machines. Unfortunately, we have additional
>> complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
>> cases manual intervention is required. Especially, when further unplug
>> is expected.
>
> and that is why we do not want to hardwire this into the kernel and we
> have a notification to handle this in userspace.

Yes and I don't know about any plans to remove this notification. In
case some really complex handling is required just don't turn on the
automatic onlining.

Memory hotplug in real x86 hardware is rare, memory hotplug for VMs is
ubiquitous.

>
>> 2) Adding new memory can (in some extreme cases) still fail as we need
>> some *other* memory before we're able to online the newly added
>> block. This is an issue to be solved and it is doable (IMO) with some
>> pre-allocation.
>
> you cannot preallocate for all the possible memory that can be added.

For all, no, but for 1 next block - yes, and then I'll preallocate for
the next one.

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Michal Hocko
On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
[...]
> I don't have anything new to add to the discussion happened last week
> but I'd like to summarize my arguments against this change:
> 
> 1) This patch doesn't solve any issue. Configuration option is not an
> issue by itself, it is an option for distros to decide what they want to
> ship: udev rule with known issues (legacy mode) or enable the new
> option. Distro makers and users building their kernels should be able to
> answer this simple question "do you want to automatically online all
> newly added memory or not".

OK, so could you be more specific? Distributions have no clue about
which HW their kernel runs on so how can they possibly make a sensible
decision here?

> There are distros already which ship kernels
> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
> far as I remember, maybe someone else).
> 
> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
> memory won't get onlined. I don't understand how this problem is
> supposed to be solved by distros. They'll *have to* continue shipping
> a udev rule which has and always will have issues.

They have notifications for udev to online that memory and AFAICU
neither KVM nor VMware are using memory hotplut for ballooning - unlike
HyperV and Xen.

> 3) Kernel command line is not a viable choice, it is rather a debug
> method.

Why?

> Having all newly added memory online as soon as possible is a
> major use-case not something a couple of users wants (and this is
> proved by major distros shipping the unconditional 'offline->online'
> rule with udev).

I would argue because this really depends on the usecase. a) somebody
might want to online memory as movable and that really depends on which
node we are talking about because not all of them can be movable b) it
is easier to handle potential errors from userspace than the kernel.

> A couple of other thoughts:
> 1) Having all newly added memory online ASAP is probably what people
> want for all virtual machines. Unfortunately, we have additional
> complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
> cases manual intervention is required. Especially, when further unplug
> is expected.

and that is why we do not want to hardwire this into the kernel and we
have a notification to handle this in userspace.

> 2) Adding new memory can (in some extreme cases) still fail as we need
> some *other* memory before we're able to online the newly added
> block. This is an issue to be solved and it is doable (IMO) with some
> pre-allocation.

you cannot preallocate for all the possible memory that can be added.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> From: Michal Hocko 
>
> This knob has been added by 31bc3858ea3e ("memory-hotplug: add automatic
> onlining policy for the newly added memory") mainly to cover memory
> hotplug based balooning solutions currently implemented for HyperV
> and Xen. Both of them want to online the memory as soon after
> registering as possible otherwise they can register too much memory
> which cannot be used and trigger the oom killer (we need ~1.5% of the
> registered memory so a large increase can consume all the available
> memory). hv_mem_hot_add even waits for the userspace to online the
> memory if the auto onlining is disabled to mitigate that problem.
>
> Adding yet another knob and a config option just doesn't make much sense
> IMHO. How is a random user supposed to know when to enable this option?
> Ballooning drivers know much better that they want to do an immediate
> online rather than waiting for the userspace to do that. If the memory
> is onlined for a different purpose then we already have a notification
> for the userspace and udev can handle the onlining. So the knob as well
> as the config option for the default behavior just doesn't make any
> sense. Let's remove them and allow user of add_memory to request the
> online status explicitly. Not only it makes more sense it also removes a
> lot of clutter.
>
> Signed-off-by: Michal Hocko 
> ---
>
> Hi,
> I am sending this as an RFC because this is a user visible change. Maybe
> we won't be able to remove the sysfs knob which would be sad, especially
> when it has been added without a wider discussion and IMHO it is just
> wrong. Is there any reason why a kernel command line parameter wouldn't
> work just fine?
>
> Even in that case I believe that we should remove
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE knob. It just adds to an already
> messy config space. Does anybody depend on the policy during the early
> boot before the userspace can set the sysfs knob? Or why those users cannot
> simply use the kernel command line parameter.
>
> I also believe that the wait-for-userspace in hyperV should just die. It
> should do the unconditional onlining. Same as Xen. I do not see any
> reason why those should depend on the userspace. This should be just
> fixed regardless of the sysfs/config part. I can separate this out of course.
>
> Thoughts/Concerns?

I don't have anything new to add to the discussion happened last week
but I'd like to summarize my arguments against this change:

1) This patch doesn't solve any issue. Configuration option is not an
issue by itself, it is an option for distros to decide what they want to
ship: udev rule with known issues (legacy mode) or enable the new
option. Distro makers and users building their kernels should be able to
answer this simple question "do you want to automatically online all
newly added memory or not". There are distros already which ship kernels
with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
far as I remember, maybe someone else).

2) This patch creates an imbalance between Xen/Hyper-V on one side and
KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
memory won't get onlined. I don't understand how this problem is
supposed to be solved by distros. They'll *have to* continue shipping
a udev rule which has and always will have issues.

3) Kernel command line is not a viable choice, it is rather a debug
method. Having all newly added memory online as soon as possible is a
major use-case not something a couple of users wants (and this is
proved by major distros shipping the unconditional 'offline->online'
rule with udev).

A couple of other thoughts:
1) Having all newly added memory online ASAP is probably what people
want for all virtual machines. Unfortunately, we have additional
complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
cases manual intervention is required. Especially, when further unplug
is expected.

2) Adding new memory can (in some extreme cases) still fail as we need
some *other* memory before we're able to online the newly added
block. This is an issue to be solved and it is doable (IMO) with some
pre-allocation.

I'd also like to notice that this patch doesn't re-introduce the issue I
was fixing with in-kernel memory onlining as all memory added through
the Hyper-V driver will be auto-onlined unconditionally. What I disagree
with here is taking away choice without fixing any real world issues.

[snip]

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-02-27 Thread Michal Hocko
From: Michal Hocko 

This knob has been added by 31bc3858ea3e ("memory-hotplug: add automatic
onlining policy for the newly added memory") mainly to cover memory
hotplug based balooning solutions currently implemented for HyperV
and Xen. Both of them want to online the memory as soon after
registering as possible otherwise they can register too much memory
which cannot be used and trigger the oom killer (we need ~1.5% of the
registered memory so a large increase can consume all the available
memory). hv_mem_hot_add even waits for the userspace to online the
memory if the auto onlining is disabled to mitigate that problem.

Adding yet another knob and a config option just doesn't make much sense
IMHO. How is a random user supposed to know when to enable this option?
Ballooning drivers know much better that they want to do an immediate
online rather than waiting for the userspace to do that. If the memory
is onlined for a different purpose then we already have a notification
for the userspace and udev can handle the onlining. So the knob as well
as the config option for the default behavior just doesn't make any
sense. Let's remove them and allow user of add_memory to request the
online status explicitly. Not only it makes more sense it also removes a
lot of clutter.

Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because this is a user visible change. Maybe
we won't be able to remove the sysfs knob which would be sad, especially
when it has been added without a wider discussion and IMHO it is just
wrong. Is there any reason why a kernel command line parameter wouldn't
work just fine?

Even in that case I believe that we should remove
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE knob. It just adds to an already
messy config space. Does anybody depend on the policy during the early
boot before the userspace can set the sysfs knob? Or why those users cannot
simply use the kernel command line parameter.

I also believe that the wait-for-userspace in hyperV should just die. It
should do the unconditional onlining. Same as Xen. I do not see any
reason why those should depend on the userspace. This should be just
fixed regardless of the sysfs/config part. I can separate this out of course.

Thoughts/Concerns?

 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/base/memory.c  | 33 +
 drivers/hv/hv_balloon.c| 26 +-
 drivers/s390/char/sclp_cmd.c   |  2 +-
 drivers/xen/balloon.c  |  2 +-
 include/linux/memory_hotplug.h |  4 +---
 mm/Kconfig | 16 
 mm/memory_hotplug.c| 22 ++
 8 files changed, 8 insertions(+), 99 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..2b1c35fb36d1 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = add_memory(node, info->start_addr, info->length, 
false);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fa26ffd25fa6..476c2c02f938 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -446,37 +446,6 @@ print_block_size(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL);
 
 /*
- * Memory auto online policy.
- */
-
-static ssize_t
-show_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-   char *buf)
-{
-   if (memhp_auto_online)
-   return sprintf(buf, "online\n");
-   else
-   return sprintf(buf, "offline\n");
-}
-
-static ssize_t
-store_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-const char *buf, size_t count)
-{
-   if (sysfs_streq(buf, "online"))
-   memhp_auto_online = true;
-   else if (sysfs_streq(buf, "offline"))
-   memhp_auto_online = false;
-   else
-   return -EINVAL;
-
-   return count;
-}
-
-static DEVICE_ATTR(auto_online_blocks, 0644, show_auto_online_blocks,
-  store_auto_online_blocks);
-
-/*
  * Some architectures will have custom drivers to do this, and
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
@@ -500,7 +469,7 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
 
nid = memory_add_physaddr_to_nid(phys_addr);
ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE *