Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-22 Thread peter enderborg
On 11/03/2017 07:35 AM, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
>
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
>
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
>
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.
What android version is using systemd?
In android there is a OnTrimMemory that is sent from activitymanager that
you can listen on and make a nice exit.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-22 Thread peter enderborg
On 11/03/2017 07:35 AM, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
>
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
>
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
>
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.
What android version is using systemd?
In android there is a OnTrimMemory that is sent from activitymanager that
you can listen on and make a nice exit.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Michal Hocko
On Mon 20-11-17 20:48:10, Shawn Landden wrote:
> On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko  wrote:
> > On Fri 17-11-17 20:45:03, Shawn Landden wrote:
> >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> >>
> >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> >> > > It is common for services to be stateless around their main event loop.
> >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> >> > > signals to the kernel that epoll_wait() and friends may not complete,
> >> > > and the kernel may send SIGKILL if resources get tight.
> >> > >
> >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> >> > >
> >> > > Android uses this memory model for all programs, and having it in the
> >> > > kernel will enable integration with the page cache (not in this
> >> > > series).
> >> > >
> >> > > 16 bytes per process is kinda spendy, but I want to keep
> >> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> >> > > like Android's user input is keeping track this can be done in
> >> > user-space.
> >> > > It could be pulled out of task_struct if an cross-indexing additional
> >> > > red-black tree is added to support pid-based lookup.
> >> >
> >> > This is still an abuse and the patch is wrong. We really do have an API
> >> > to use I fail to see why you do not use it.
> >> >
> >> When I looked at wait_queue_head_t it was 20 byes.
> >
> > I do not understand. What I meant to say is that we do have a proper
> > user api to hint OOM killer decisions.
> This is a FIFO queue, rather than a heuristic, which is all you get
> with the current API.

Yes I can read the code. All I am saing is that we already have an API
to achieve what you want or at least very similar.

Let me be explicit.
Nacked-by: Michal Hocko 
until it is sufficiently explained that the oom_score_adj is not
suitable and there are no other means to achieve what you need.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Michal Hocko
On Mon 20-11-17 20:48:10, Shawn Landden wrote:
> On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko  wrote:
> > On Fri 17-11-17 20:45:03, Shawn Landden wrote:
> >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> >>
> >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> >> > > It is common for services to be stateless around their main event loop.
> >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> >> > > signals to the kernel that epoll_wait() and friends may not complete,
> >> > > and the kernel may send SIGKILL if resources get tight.
> >> > >
> >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> >> > >
> >> > > Android uses this memory model for all programs, and having it in the
> >> > > kernel will enable integration with the page cache (not in this
> >> > > series).
> >> > >
> >> > > 16 bytes per process is kinda spendy, but I want to keep
> >> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> >> > > like Android's user input is keeping track this can be done in
> >> > user-space.
> >> > > It could be pulled out of task_struct if an cross-indexing additional
> >> > > red-black tree is added to support pid-based lookup.
> >> >
> >> > This is still an abuse and the patch is wrong. We really do have an API
> >> > to use I fail to see why you do not use it.
> >> >
> >> When I looked at wait_queue_head_t it was 20 byes.
> >
> > I do not understand. What I meant to say is that we do have a proper
> > user api to hint OOM killer decisions.
> This is a FIFO queue, rather than a heuristic, which is all you get
> with the current API.

Yes I can read the code. All I am saing is that we already have an API
to achieve what you want or at least very similar.

Let me be explicit.
Nacked-by: Michal Hocko 
until it is sufficiently explained that the oom_score_adj is not
suitable and there are no other means to achieve what you need.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Shawn Landden
On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko  wrote:
> On Fri 17-11-17 20:45:03, Shawn Landden wrote:
>> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
>>
>> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
>> > > It is common for services to be stateless around their main event loop.
>> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
>> > > signals to the kernel that epoll_wait() and friends may not complete,
>> > > and the kernel may send SIGKILL if resources get tight.
>> > >
>> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
>> > >
>> > > Android uses this memory model for all programs, and having it in the
>> > > kernel will enable integration with the page cache (not in this
>> > > series).
>> > >
>> > > 16 bytes per process is kinda spendy, but I want to keep
>> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
>> > > like Android's user input is keeping track this can be done in
>> > user-space.
>> > > It could be pulled out of task_struct if an cross-indexing additional
>> > > red-black tree is added to support pid-based lookup.
>> >
>> > This is still an abuse and the patch is wrong. We really do have an API
>> > to use I fail to see why you do not use it.
>> >
>> When I looked at wait_queue_head_t it was 20 byes.
>
> I do not understand. What I meant to say is that we do have a proper
> user api to hint OOM killer decisions.
This is a FIFO queue, rather than a heuristic, which is all you get
with the current API.
> --
> Michal Hocko
> SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Shawn Landden
On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko  wrote:
> On Fri 17-11-17 20:45:03, Shawn Landden wrote:
>> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
>>
>> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
>> > > It is common for services to be stateless around their main event loop.
>> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
>> > > signals to the kernel that epoll_wait() and friends may not complete,
>> > > and the kernel may send SIGKILL if resources get tight.
>> > >
>> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
>> > >
>> > > Android uses this memory model for all programs, and having it in the
>> > > kernel will enable integration with the page cache (not in this
>> > > series).
>> > >
>> > > 16 bytes per process is kinda spendy, but I want to keep
>> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
>> > > like Android's user input is keeping track this can be done in
>> > user-space.
>> > > It could be pulled out of task_struct if an cross-indexing additional
>> > > red-black tree is added to support pid-based lookup.
>> >
>> > This is still an abuse and the patch is wrong. We really do have an API
>> > to use I fail to see why you do not use it.
>> >
>> When I looked at wait_queue_head_t it was 20 byes.
>
> I do not understand. What I meant to say is that we do have a proper
> user api to hint OOM killer decisions.
This is a FIFO queue, rather than a heuristic, which is all you get
with the current API.
> --
> Michal Hocko
> SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Michal Hocko
On Fri 17-11-17 20:45:03, Shawn Landden wrote:
> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> 
> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> > > It is common for services to be stateless around their main event loop.
> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> > > signals to the kernel that epoll_wait() and friends may not complete,
> > > and the kernel may send SIGKILL if resources get tight.
> > >
> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> > >
> > > Android uses this memory model for all programs, and having it in the
> > > kernel will enable integration with the page cache (not in this
> > > series).
> > >
> > > 16 bytes per process is kinda spendy, but I want to keep
> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> > > like Android's user input is keeping track this can be done in
> > user-space.
> > > It could be pulled out of task_struct if an cross-indexing additional
> > > red-black tree is added to support pid-based lookup.
> >
> > This is still an abuse and the patch is wrong. We really do have an API
> > to use I fail to see why you do not use it.
> >
> When I looked at wait_queue_head_t it was 20 byes.

I do not understand. What I meant to say is that we do have a proper
user api to hint OOM killer decisions.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-20 Thread Michal Hocko
On Fri 17-11-17 20:45:03, Shawn Landden wrote:
> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> 
> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> > > It is common for services to be stateless around their main event loop.
> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> > > signals to the kernel that epoll_wait() and friends may not complete,
> > > and the kernel may send SIGKILL if resources get tight.
> > >
> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> > >
> > > Android uses this memory model for all programs, and having it in the
> > > kernel will enable integration with the page cache (not in this
> > > series).
> > >
> > > 16 bytes per process is kinda spendy, but I want to keep
> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> > > like Android's user input is keeping track this can be done in
> > user-space.
> > > It could be pulled out of task_struct if an cross-indexing additional
> > > red-black tree is added to support pid-based lookup.
> >
> > This is still an abuse and the patch is wrong. We really do have an API
> > to use I fail to see why you do not use it.
> >
> When I looked at wait_queue_head_t it was 20 byes.

I do not understand. What I meant to say is that we do have a proper
user api to hint OOM killer decisions.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-18 Thread Matthew Wilcox
On Fri, Nov 17, 2017 at 08:45:03PM -0800, Shawn Landden wrote:
> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> > > 16 bytes per process is kinda spendy, but I want to keep
> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> > > like Android's user input is keeping track this can be done in
> > user-space.
> > > It could be pulled out of task_struct if an cross-indexing additional
> > > red-black tree is added to support pid-based lookup.
> >
> > This is still an abuse and the patch is wrong. We really do have an API
> > to use I fail to see why you do not use it.
> >
> When I looked at wait_queue_head_t it was 20 byes.

24 bytes actually; the compiler will add 4 bytes of padding between
the spinlock and the list_head.  But there's one for the entire system.
Then you add a 40 byte structure (wait_queue_entry) on the stack for each
sleeping process.  There's no per-process cost.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-18 Thread Matthew Wilcox
On Fri, Nov 17, 2017 at 08:45:03PM -0800, Shawn Landden wrote:
> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> > On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> > > 16 bytes per process is kinda spendy, but I want to keep
> > > lru behavior, which mem_score_adj does not allow. When a supervisor,
> > > like Android's user input is keeping track this can be done in
> > user-space.
> > > It could be pulled out of task_struct if an cross-indexing additional
> > > red-black tree is added to support pid-based lookup.
> >
> > This is still an abuse and the patch is wrong. We really do have an API
> > to use I fail to see why you do not use it.
> >
> When I looked at wait_queue_head_t it was 20 byes.

24 bytes actually; the compiler will add 4 bytes of padding between
the spinlock and the list_head.  But there's one for the entire system.
Then you add a 40 byte structure (wait_queue_entry) on the stack for each
sleeping process.  There's no per-process cost.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-18 Thread Shawn Landden
On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> On Thu 02-11-17 23:35:44, Shawn Landden wrote:
>> 16 bytes per process is kinda spendy, but I want to keep
>> lru behavior, which mem_score_adj does not allow. When a supervisor,
>> like Android's user input is keeping track this can be done in user-space.
>> It could be pulled out of task_struct if an cross-indexing additional
>> red-black tree is added to support pid-based lookup.
>
> This is still an abuse and the patch is wrong. We really do have an API
> to use I fail to see why you do not use it.
When I looked at wait_queue_head_t it was 20 bytes.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-18 Thread Shawn Landden
On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko  wrote:
> On Thu 02-11-17 23:35:44, Shawn Landden wrote:
>> 16 bytes per process is kinda spendy, but I want to keep
>> lru behavior, which mem_score_adj does not allow. When a supervisor,
>> like Android's user input is keeping track this can be done in user-space.
>> It could be pulled out of task_struct if an cross-indexing additional
>> red-black tree is added to support pid-based lookup.
>
> This is still an abuse and the patch is wrong. We really do have an API
> to use I fail to see why you do not use it.
When I looked at wait_queue_head_t it was 20 bytes.


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-15 Thread Pavel Machek
Hi!

> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
> 
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> 
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
> 
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.

Having android-like system for low memory killing might be
interesting... but rather than throwing around patches maybe there
should be discussion on lkml how the interface should look like,
first?

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-15 Thread Pavel Machek
Hi!

> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
> 
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> 
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
> 
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.

Having android-like system for low memory killing might be
interesting... but rather than throwing around patches maybe there
should be discussion on lkml how the interface should look like,
first?

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-03 Thread Michal Hocko
On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
> 
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> 
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
> 
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.

This is still an abuse and the patch is wrong. We really do have an API
to use I fail to see why you do not use it.

[...]
> @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc)
>   return true;
>   }
>  
> + /*
> +  * Check death row for current memcg or global.
> +  */
> + l = oom_target_get_queue(current);
> + if (!list_empty(l)) {
> + struct task_struct *ts = list_first_entry(l,
> + struct task_struct, se.oom_target_queue);
> +
> + pr_debug("Killing pid %u from EPOLL_KILLME death row.",
> +  ts->pid);
> +
> + /* We use SIGKILL instead of the oom killer
> +  * so as to cleanly interrupt ep_poll()
> +  */
> + send_sig(SIGKILL, ts, 1);
> + return true;
> + }

Still not NUMA aware and completely backwards. If this is a memcg OOM
then it is _memcg_ to evaluate not the current. The oom might happen up
the hierarchy due to hard limit.

But still, you should be very clear _why_ the existing oom tuning is not
appropropriate and we can think of a way to hanle it better but cramming
the oom selection this way is simply not acceptable.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-03 Thread Michal Hocko
On Thu 02-11-17 23:35:44, Shawn Landden wrote:
> It is common for services to be stateless around their main event loop.
> If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
> signals to the kernel that epoll_wait() and friends may not complete,
> and the kernel may send SIGKILL if resources get tight.
> 
> See my systemd patch: https://github.com/shawnl/systemd/tree/prctl
> 
> Android uses this memory model for all programs, and having it in the
> kernel will enable integration with the page cache (not in this
> series).
> 
> 16 bytes per process is kinda spendy, but I want to keep
> lru behavior, which mem_score_adj does not allow. When a supervisor,
> like Android's user input is keeping track this can be done in user-space.
> It could be pulled out of task_struct if an cross-indexing additional
> red-black tree is added to support pid-based lookup.

This is still an abuse and the patch is wrong. We really do have an API
to use I fail to see why you do not use it.

[...]
> @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc)
>   return true;
>   }
>  
> + /*
> +  * Check death row for current memcg or global.
> +  */
> + l = oom_target_get_queue(current);
> + if (!list_empty(l)) {
> + struct task_struct *ts = list_first_entry(l,
> + struct task_struct, se.oom_target_queue);
> +
> + pr_debug("Killing pid %u from EPOLL_KILLME death row.",
> +  ts->pid);
> +
> + /* We use SIGKILL instead of the oom killer
> +  * so as to cleanly interrupt ep_poll()
> +  */
> + send_sig(SIGKILL, ts, 1);
> + return true;
> + }

Still not NUMA aware and completely backwards. If this is a memcg OOM
then it is _memcg_ to evaluate not the current. The oom might happen up
the hierarchy due to hard limit.

But still, you should be very clear _why_ the existing oom tuning is not
appropropriate and we can think of a way to hanle it better but cramming
the oom selection this way is simply not acceptable.
-- 
Michal Hocko
SUSE Labs


[RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-03 Thread Shawn Landden
It is common for services to be stateless around their main event loop.
If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
signals to the kernel that epoll_wait() and friends may not complete,
and the kernel may send SIGKILL if resources get tight.

See my systemd patch: https://github.com/shawnl/systemd/tree/prctl

Android uses this memory model for all programs, and having it in the
kernel will enable integration with the page cache (not in this
series).

16 bytes per process is kinda spendy, but I want to keep
lru behavior, which mem_score_adj does not allow. When a supervisor,
like Android's user input is keeping track this can be done in user-space.
It could be pulled out of task_struct if an cross-indexing additional
red-black tree is added to support pid-based lookup.

v2
switch to prctl, memcg support
---
 fs/eventpoll.c | 17 +
 fs/proc/array.c|  7 ++
 include/linux/memcontrol.h |  3 +++
 include/linux/oom.h|  4 
 include/linux/sched.h  |  4 
 include/uapi/linux/prctl.h |  4 
 kernel/cgroup/cgroup.c | 12 ++
 kernel/exit.c  |  2 ++
 kernel/sys.c   |  9 +++
 mm/memcontrol.c|  4 
 mm/oom_kill.c  | 60 ++
 11 files changed, 126 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2fabd19cdeea..04011fca038b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * LOCKING:
@@ -1762,6 +1763,14 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
 
+   if (current->oom_target) {
+   spin_lock(oom_target_get_spinlock(current));
+   list_add(>se.oom_target_queue,
+oom_target_get_queue(current));
+   current->se.oom_target_on_queue = 1;
+   spin_unlock(oom_target_get_spinlock(current));
+   }
+
if (timeout > 0) {
struct timespec64 end_time = ep_set_mstimeout(timeout);
 
@@ -1783,6 +1792,7 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
if (!ep_events_available(ep))
ep_busy_loop(ep, timed_out);
 
+
spin_lock_irqsave(>lock, flags);
 
if (!ep_events_available(ep)) {
@@ -1850,6 +1860,13 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
 
+   if (current->oom_target) {
+   spin_lock(oom_target_get_spinlock(current));
+   list_del(>se.oom_target_queue);
+   current->se.oom_target_on_queue = 0;
+   spin_unlock(oom_target_get_spinlock(current));
+   }
+
return res;
 }
 
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 77a8eacbe032..cab009727a7f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -349,6 +349,12 @@ static inline void task_seccomp(struct seq_file *m, struct 
task_struct *p)
seq_putc(m, '\n');
 }
 
+static inline void task_idle(struct seq_file *m, struct task_struct *p)
+{
+   seq_put_decimal_ull(m, "Idle:\t", p->oom_target);
+   seq_putc(m, '\n');
+}
+
 static inline void task_context_switch_counts(struct seq_file *m,
struct task_struct *p)
 {
@@ -380,6 +386,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_sig(m, task);
task_cap(m, task);
task_seccomp(m, task);
+   task_idle(m, task);
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..40a2db8ae522 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -261,6 +261,9 @@ struct mem_cgroup {
struct list_head event_list;
spinlock_t event_list_lock;
 
+   struct list_headoom_target_queue;
+   spinlock_t  oom_target_spinlock;
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..a5d16eb05297 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -101,6 +101,10 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern void exit_oom_target(void);
+struct list_head *oom_target_get_queue(struct task_struct *ts);
+spinlock_t *oom_target_get_spinlock(struct task_struct *ts);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a7df4e558c..2b110c4d7357 100644
--- 

[RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops

2017-11-03 Thread Shawn Landden
It is common for services to be stateless around their main event loop.
If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it
signals to the kernel that epoll_wait() and friends may not complete,
and the kernel may send SIGKILL if resources get tight.

See my systemd patch: https://github.com/shawnl/systemd/tree/prctl

Android uses this memory model for all programs, and having it in the
kernel will enable integration with the page cache (not in this
series).

16 bytes per process is kinda spendy, but I want to keep
lru behavior, which mem_score_adj does not allow. When a supervisor,
like Android's user input is keeping track this can be done in user-space.
It could be pulled out of task_struct if an cross-indexing additional
red-black tree is added to support pid-based lookup.

v2
switch to prctl, memcg support
---
 fs/eventpoll.c | 17 +
 fs/proc/array.c|  7 ++
 include/linux/memcontrol.h |  3 +++
 include/linux/oom.h|  4 
 include/linux/sched.h  |  4 
 include/uapi/linux/prctl.h |  4 
 kernel/cgroup/cgroup.c | 12 ++
 kernel/exit.c  |  2 ++
 kernel/sys.c   |  9 +++
 mm/memcontrol.c|  4 
 mm/oom_kill.c  | 60 ++
 11 files changed, 126 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2fabd19cdeea..04011fca038b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * LOCKING:
@@ -1762,6 +1763,14 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
 
+   if (current->oom_target) {
+   spin_lock(oom_target_get_spinlock(current));
+   list_add(>se.oom_target_queue,
+oom_target_get_queue(current));
+   current->se.oom_target_on_queue = 1;
+   spin_unlock(oom_target_get_spinlock(current));
+   }
+
if (timeout > 0) {
struct timespec64 end_time = ep_set_mstimeout(timeout);
 
@@ -1783,6 +1792,7 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
if (!ep_events_available(ep))
ep_busy_loop(ep, timed_out);
 
+
spin_lock_irqsave(>lock, flags);
 
if (!ep_events_available(ep)) {
@@ -1850,6 +1860,13 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
 
+   if (current->oom_target) {
+   spin_lock(oom_target_get_spinlock(current));
+   list_del(>se.oom_target_queue);
+   current->se.oom_target_on_queue = 0;
+   spin_unlock(oom_target_get_spinlock(current));
+   }
+
return res;
 }
 
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 77a8eacbe032..cab009727a7f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -349,6 +349,12 @@ static inline void task_seccomp(struct seq_file *m, struct 
task_struct *p)
seq_putc(m, '\n');
 }
 
+static inline void task_idle(struct seq_file *m, struct task_struct *p)
+{
+   seq_put_decimal_ull(m, "Idle:\t", p->oom_target);
+   seq_putc(m, '\n');
+}
+
 static inline void task_context_switch_counts(struct seq_file *m,
struct task_struct *p)
 {
@@ -380,6 +386,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_sig(m, task);
task_cap(m, task);
task_seccomp(m, task);
+   task_idle(m, task);
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..40a2db8ae522 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -261,6 +261,9 @@ struct mem_cgroup {
struct list_head event_list;
spinlock_t event_list_lock;
 
+   struct list_headoom_target_queue;
+   spinlock_t  oom_target_spinlock;
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..a5d16eb05297 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -101,6 +101,10 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern void exit_oom_target(void);
+struct list_head *oom_target_get_queue(struct task_struct *ts);
+spinlock_t *oom_target_get_spinlock(struct task_struct *ts);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a7df4e558c..2b110c4d7357 100644
---