Re: [PATCH 1/2] sched: deferred set priority (dprio) -- rebased for the tip

2014-09-26 Thread Sergey Oboguev
On Fri, Sep 26, 2014 at 6:23 AM, Peter Zijlstra  wrote:

>> This is a replica of "[PATCH 1/2] dprio" (posted yesterday for 3.16.3)
>> rebased now for the current tip (3.17.0-rc6).
>>
> 2 lines of changelog for a ~1300 line patch, you must be kidding, right?
>
> What problem is it solving and why do I care?


Hi Peter,

It is described in message [PATCH 0/2]
https://lkml.org/lkml/2014/9/24/1030
and further links from it,
as well as in the man page (patch message [PATCH 2/2])
https://lkml.org/lkml/2014/9/24/1043

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] sched: deferred set priority (dprio) -- rebased for the tip

2014-09-25 Thread Sergey Oboguev
This is a replica of "[PATCH 1/2] dprio" (posted yesterday for 3.16.3)
rebased now for the current tip (3.17.0-rc6).

Signed-off-by: Sergey Oboguev 

---
 Documentation/sysctl/kernel.txt |  14 +
 fs/exec.c   |   8 +
 include/linux/dprio.h   | 129 +
 include/linux/init_task.h   |  17 ++
 include/linux/sched.h   |  19 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/capability.h |   5 +-
 include/uapi/linux/dprio_api.h  | 137 +
 include/uapi/linux/prctl.h  |   2 +
 init/Kconfig|   2 +
 kernel/Kconfig.dprio|  68 +
 kernel/exit.c   |   6 +
 kernel/fork.c   |  88 +-
 kernel/sched/Makefile   |   1 +
 kernel/sched/core.c | 195 -
 kernel/sched/dprio.c| 617 
 kernel/sys.c|   6 +
 kernel/sysctl.c |  12 +
 18 files changed, 1315 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index f79eb96..7b379cd 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -30,6 +30,7 @@ show up in /proc/sys/kernel:
 - core_uses_pid
 - ctrl-alt-del
 - dmesg_restrict
+- dprio_privileged
 - domainname
 - hostname
 - hotplug
@@ -267,6 +268,19 @@ default value of dmesg_restrict.

 ==

+dprio_privileged:
+
+This toggle indicates whether unprivileged users are prevented
+from using dprio(2) to execute deferred set priority requests.
+When dprio_privileged is set to (0) there are no restrictions.
+When dprio_privileged is set set to (1), users must have CAP_DPRIO
+to use dprio(2), i.e. prctl(PR_SET_DEFERRED_SETPRIO).
+
+The kernel config option CONFIG_DEFERRED_SETPRIO_PRIVILEGED sets
+the default value of dprio_privileged.
+
+==
+
 domainname & hostname:

 These files can be used to set the NIS/YP domainname and the
diff --git a/fs/exec.c b/fs/exec.c
index a2b42a9..439bc42 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1430,6 +1431,7 @@ static int do_execve_common(struct filename *filename,
 struct file *file;
 struct files_struct *displaced;
 int retval;
+struct dprio_saved_context dprio_context;

 if (IS_ERR(filename))
 return PTR_ERR(filename);
@@ -1480,6 +1482,9 @@ static int do_execve_common(struct filename *filename,
 if (retval)
 goto out_unmark;

+dprio_handle_request();
+dprio_save_reset_context(&dprio_context);
+
 bprm->argc = count(argv, MAX_ARG_STRINGS);
 if ((retval = bprm->argc) < 0)
 goto out;
@@ -1518,6 +1523,7 @@ static int do_execve_common(struct filename *filename,
 putname(filename);
 if (displaced)
 put_files_struct(displaced);
+dprio_free_context(&dprio_context);
 return retval;

 out:
@@ -1526,6 +1532,8 @@ out:
 mmput(bprm->mm);
 }

+dprio_restore_context(&dprio_context);
+
 out_unmark:
 current->fs->in_exec = 0;
 current->in_execve = 0;
diff --git a/include/linux/dprio.h b/include/linux/dprio.h
new file mode 100644
index 000..1119c00
--- /dev/null
+++ b/include/linux/dprio.h
@@ -0,0 +1,129 @@
+/*
+ * include/linux/dprio.h
+ *
+ * Deferred set priority.
+ *
+ * Started by (C) 2014 Sergey Oboguev 
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#ifndef _LINUX_DPRIO_H
+#define _LINUX_DPRIO_H
+
+#include 
+#include 
+
+#ifdef CONFIG_DEFERRED_SETPRIO
+
+/*
+ * @mask contains bit-flags indicating which policies have been pre-approved.
+ * Other fields are valid only if the corresponding bit is set in the @mask.
+ */
+static __always_inline void __dprio_info_assumptions(void)
+{
+/* SCHED_xxx is used as a bit index in @mask */
+BUILD_BUG_ON(SCHED_NORMAL > 31);
+BUILD_BUG_ON(SCHED_FIFO > 31);
+BUILD_BUG_ON(SCHED_RR > 31);
+BUILD_BUG_ON(SCHED_BATCH > 31);
+BUILD_BUG_ON(SCHED_IDLE > 31);
+}
+struct dprio_info {
+unsigned mask;
+s32 normal_sched_nice;
+s32 batch_sched_nice;
+u32 fifo_sched_priority;
+u32 rr_sched_priority;
+bool capable_sys_nice;
+};
+
+/*
+ * Called by dup_task_struct to reset non-inherited fields
+ */
+static __always_inline void set_task_in_dprio(struct task_struct *tsk,
+  bool in_dprio)
+{
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+tsk->in_dprio = in_dprio;
+#endif
+}
+
+static inline void dprio_dup_task_struct(struct task_struct *tsk)
+{
+/* reset deferred setprio fields not inherited from the parent */
+tsk->dprio_ku_area_pp = NULL;
+tsk->dprio_info = NULL;
+set_task_in_dprio(tsk, false);
+}
+
+void dprio_d

[PATCH 0/2] sched: deferred set priority (dprio)

2014-09-24 Thread Sergey Oboguev
This is a resultant version of the patch based on a previous RFC.

This patch is intended to improve the support for fine-grain parallel
applications that may sometimes need to change the priority of their threads at
a very high rate, hundreds or even thousands of times per scheduling timeslice.

These are typically applications that have to execute short or very short
lock-holding critical or otherwise time-urgent sections of code at a very high
frequency and need to protect these sections with "set priority" system calls,
one "set priority" call to elevate current thread priority before entering the
critical or time-urgent section, followed by another call to downgrade thread
priority at the completion of the section. Due to the high frequency of
entering and leaving critical or time-urgent sections, the cost of these "set
priority" system calls may raise to a noticeable part of an application's
overall expended CPU time. Proposed "deferred set priority" facility allows to
largely eliminate the cost of these system calls.

Instead of executing a system call to elevate its thread priority, an
application simply writes its desired priority level to a designated memory
location in the userspace. When the kernel attempts to preempt the thread, it
first checks the content of this location, and if the application's stated
request to change its priority has been posted in the designated memory area,
the kernel will execute this request and alter the priority of the thread being
preempted before performing a rescheduling, and then make a scheduling decision
based on the new thread priority level thus implementing the priority
protection of the critical or time-urgent section desired by the application.
In a predominant number of cases however, an application will complete the
critical section before the end of the current timeslice and cancel or alter
the request held in the userspace area. Thus a vast majority of an
application's change priority requests will be handled and mutually cancelled
or coalesced within the userspace, at a very low overhead and without incurring
the cost of a system call, while maintaining safe preemption control. The cost
of an actual kernel-level "set priority" operation is incurred only if an
application is actually being preempted while inside the critical section, i.e.
typically at most once per scheduling timeslice instead of hundreds or
thousands "set priority" system calls in the same timeslice.

One of the intended purposes of this facility (but its not sole purpose) is to
render a lightweight mechanism for priority protection of lock-holding critical
sections that would be an adequate match for lightweight locking primitives
such as futex, with both featuring a fast path completing within the userspace.

More detailed description can be found in:
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt
and also in the accompanying man page in the subsequent message.

Message 1/2 contains the patch to the kernel tree (3.16.3).
Message 2/2 contains the patch for man pages tree.

User-level library implementing userspace-side boilerplate code:
https://github.com/oboguev/dprio/tree/master/src/userlib

Test set:
https://github.com/oboguev/dprio/tree/master/src/test

Previous RFC discussion:
https://lkml.org/lkml/2014/7/25/600

Brief summary/conclusions of the discussion:
https://lkml.org/lkml/2014/8/13/744

The patch is enabled with CONFIG_DEFERRED_SETPRIO.
There is also a few other config settings: a setting for dprio debug code,
a setting that controls whether the facility is available by default for all
users or limited to tasks with CAP_DPRIO, and a setting to improve the
determinism in the rescheduling latency when dprio request is pending
under low-memory conditions. Please see dprio.txt and man page for details,
as well as the write-up in kernel/Kconfig.dprio.

The changes compared to the RFC version are:
- Replace authorization list with CAP_DPRIO and sysctl kernel.dprio_privileged.
- Move dprio_ku_area_pp inside task_struct so it is likely to share the same
  cache line with other locations accessed during __schedule().

Signed-off-by: Sergey Oboguev 

 Documentation/sysctl/kernel.txt |  14 +
 fs/exec.c   |   8 +
 include/linux/dprio.h   | 129 +
 include/linux/init_task.h   |  17 ++
 include/linux/sched.h   |  19 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/capability.h |   5 +-
 include/uapi/linux/dprio_api.h  | 137 +
 include/uapi/linux/prctl.h  |   2 +
 init/Kconfig|   2 +
 kernel/Kconfig.dprio|  68 +
 kernel/exit.c   |   6 +
 kernel/fork.c   |  88 +-
 kernel/sched/Makefile   |   1 +
 kernel/sched/core.c | 195 -
 kernel/sched/dprio.c| 617 
 kernel/sys.c   

[PATCH 2/2] sched: deferred set priority (dprio)

2014-09-24 Thread Sergey Oboguev
Message 1/2 contains the patch to the kernel tree (3.16.3).
Message 2/2 contains the patch for man pages tree.

Signed-off-by: Sergey Oboguev 

---
 man2/dprio.2 | 784 +++
 man2/prctl.2 |   5 +
 2 files changed, 789 insertions(+)

diff --git a/man2/dprio.2 b/man2/dprio.2
new file mode 100644
index 000..9d457ec
--- /dev/null
+++ b/man2/dprio.2
@@ -0,0 +1,784 @@
+.\" Copyright (C) 2003 Free Software Foundation, Inc.
+.\" This file is distributed according to the GNU General Public License.
+.\" See the file COPYING in the top level source directory for details.
+.\"
+.\" Written by Sergey Oboguev.
+.TH DPRIO 2 2014-09-05 "Linux" "Linux Programmer's Manual"
+.SH NAME
+dprio \- Deferred Set Task Priority
+.SH "DESCRIPTION"
+
+.BR dprio
+is a facility to reduce the computational
+cost of frequently-repeated
+.BR sched_setattr (2)
+system calls, intended for use by applications
+that frequently execute this call.
+
+Applications relying on fine-grain parallelism
+may sometimes need to change their threads
+priority at a very high rate, hundreds or even
+thousands of times per typical scheduling
+timeslice. These are typically applications that
+have to execute short or very short lock-holding
+critical or otherwise time-urgent sections of
+code at a very high frequency and need to protect
+these sections with "set priority" system calls,
+one "set priority" call to elevate current thread
+priority before entering the critical or
+time-urgent section, followed by another call to
+downgrade thread priority at the completion of
+the section.
+
+Due to the high frequency of entering and leaving
+critical or time-urgent sections, the cost of
+these "set priority" system calls may raise to a
+noticeable part of an application's overall
+expended CPU time.
+
+.BR dprio
+allows to largely eliminate the overhead of
+these system calls.
+
+Instead of executing a system call to elevate its
+thread priority, an application simply writes its
+desired priority level to a designated memory
+area in the userspace. When the kernel attempts
+to preempt the thread, it first checks the
+content of this area, and if the application's
+posted a request to change its priority
+into the designated memory area, the
+kernel will execute this request and alter the
+priority of the thread being preempted before
+performing a rescheduling, and then make
+scheduling decisions based on the new thread
+priority level thus implementing the priority
+protection of the critical or time-urgent section
+desired by the application.
+
+In a predominant number of cases however, an
+application will complete the critical section
+before the end of the current timeslice and
+cancel or alter the request held in the userspace
+area. Thus a vast majority of an application's
+change priority requests will be handled and
+mutually cancelled or coalesced within the
+userspace, at a very low overhead and without
+incurring the cost of a system call, while
+maintaining safe preemption control. The cost of
+an actual kernel-level "set priority" operation
+is incurred only if an application is actually
+being preempted while inside the critical
+section, i.e. typically at most once per
+scheduling timeslice instead of hundreds or
+thousands "set priority" system calls in the same
+timeslice.
+
+Application developers normally would not
+interface to
+.BR dprio
+directly and would rather use
+a user-level library simplifying the use of the
+.BR dprio
+facility and shielding an application
+developer from having to bother about low-level
+details and reimplementing the boilerplate code
+likely to be common and repetitive for all
+applications using the deferred set priority
+mechanism.
+
+Below description of
+.BR dprio
+low-level interface is
+thus of interest mostly to library and kernel
+developers.
+
+.BR dprio
+facility consists of two parts.
+
+One is the interface exposed via
+.BR prctl (2)
+with the option
+.BR PR_SET_DEFERRED_SETPRIO.
+It is used to set up the communication protocol
+between the kernel and an area in an
+application's memory space, as designated by the
+application.
+It is also used to terminate the protocol.
+
+The other part is
+.BR dprio
+userspace <-> kernel
+protocol itself, as described further below.
+
+
+.SH DPRIO PRCTL INTERFACE
+
+.BR prctl (2)
+called with option
+.BR PR_SET_DEFERRED_SETPRIO
+has two forms: one to
+set up the use of deferred set priority facility
+for the current thread, another to terminate the
+use of
+.BR dprio
+for the thread.
+
+The system call to set up the use of
+.BR dprio
+for the thread takes the form
+
+prctl(PR_SET_DEFERRED_SETPRIO,
+  dprio_ku_area_pp,
+  sched_attrs_pp,
+  sched_attrs_count,
+  0)
+
+The effect of this call is limited to the current
+thread only.
+
+\f

[PATCH 1/2] sched: deferred set priority (dprio)

2014-09-24 Thread Sergey Oboguev
Message 1/2 contains the patch to the kernel tree (3.16.3).
Message 2/2 contains the patch for man pages tree.

Signed-off-by: Sergey Oboguev 

---
 Documentation/sysctl/kernel.txt |  14 +
 fs/exec.c   |   8 +
 include/linux/dprio.h   | 129 +
 include/linux/init_task.h   |  17 ++
 include/linux/sched.h   |  19 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/capability.h |   5 +-
 include/uapi/linux/dprio_api.h  | 137 +
 include/uapi/linux/prctl.h  |   2 +
 init/Kconfig|   2 +
 kernel/Kconfig.dprio|  68 +
 kernel/exit.c   |   6 +
 kernel/fork.c   |  88 +-
 kernel/sched/Makefile   |   1 +
 kernel/sched/core.c | 195 -
 kernel/sched/dprio.c| 617 
 kernel/sys.c|   6 +
 kernel/sysctl.c |  12 +
 18 files changed, 1315 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c14374e..012cbad 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -30,6 +30,7 @@ show up in /proc/sys/kernel:
 - core_uses_pid
 - ctrl-alt-del
 - dmesg_restrict
+- dprio_privileged
 - domainname
 - hostname
 - hotplug
@@ -267,6 +268,19 @@ default value of dmesg_restrict.

 ==

+dprio_privileged:
+
+This toggle indicates whether unprivileged users are prevented
+from using dprio(2) to execute deferred set priority requests.
+When dprio_privileged is set to (0) there are no restrictions.
+When dprio_privileged is set set to (1), users must have CAP_DPRIO
+to use dprio(2), i.e. prctl(PR_SET_DEFERRED_SETPRIO).
+
+The kernel config option CONFIG_DEFERRED_SETPRIO_PRIVILEGED sets
+the default value of dprio_privileged.
+
+==
+
 domainname & hostname:

 These files can be used to set the NIS/YP domainname and the
diff --git a/fs/exec.c b/fs/exec.c
index a3d33fe..49a5547 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1434,6 +1435,7 @@ static int do_execve_common(struct filename *filename,
 struct file *file;
 struct files_struct *displaced;
 int retval;
+struct dprio_saved_context dprio_context;

 if (IS_ERR(filename))
 return PTR_ERR(filename);
@@ -1484,6 +1486,9 @@ static int do_execve_common(struct filename *filename,
 if (retval)
 goto out_unmark;

+dprio_handle_request();
+dprio_save_reset_context(&dprio_context);
+
 bprm->argc = count(argv, MAX_ARG_STRINGS);
 if ((retval = bprm->argc) < 0)
 goto out;
@@ -1522,6 +1527,7 @@ static int do_execve_common(struct filename *filename,
 putname(filename);
 if (displaced)
 put_files_struct(displaced);
+dprio_free_context(&dprio_context);
 return retval;

 out:
@@ -1530,6 +1536,8 @@ out:
 mmput(bprm->mm);
 }

+dprio_restore_context(&dprio_context);
+
 out_unmark:
 current->fs->in_exec = 0;
 current->in_execve = 0;
diff --git a/include/linux/dprio.h b/include/linux/dprio.h
new file mode 100644
index 000..1119c00
--- /dev/null
+++ b/include/linux/dprio.h
@@ -0,0 +1,129 @@
+/*
+ * include/linux/dprio.h
+ *
+ * Deferred set priority.
+ *
+ * Started by (C) 2014 Sergey Oboguev 
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#ifndef _LINUX_DPRIO_H
+#define _LINUX_DPRIO_H
+
+#include 
+#include 
+
+#ifdef CONFIG_DEFERRED_SETPRIO
+
+/*
+ * @mask contains bit-flags indicating which policies have been pre-approved.
+ * Other fields are valid only if the corresponding bit is set in the @mask.
+ */
+static __always_inline void __dprio_info_assumptions(void)
+{
+/* SCHED_xxx is used as a bit index in @mask */
+BUILD_BUG_ON(SCHED_NORMAL > 31);
+BUILD_BUG_ON(SCHED_FIFO > 31);
+BUILD_BUG_ON(SCHED_RR > 31);
+BUILD_BUG_ON(SCHED_BATCH > 31);
+BUILD_BUG_ON(SCHED_IDLE > 31);
+}
+struct dprio_info {
+unsigned mask;
+s32 normal_sched_nice;
+s32 batch_sched_nice;
+u32 fifo_sched_priority;
+u32 rr_sched_priority;
+bool capable_sys_nice;
+};
+
+/*
+ * Called by dup_task_struct to reset non-inherited fields
+ */
+static __always_inline void set_task_in_dprio(struct task_struct *tsk,
+  bool in_dprio)
+{
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+tsk->in_dprio = in_dprio;
+#endif
+}
+
+static inline void dprio_dup_task_struct(struct task_struct *tsk)
+{
+/* reset deferred setprio fields not inherited from the parent */
+tsk->dprio_ku_area_pp = NULL;
+tsk->dprio_info = NULL;
+set_task_in_dprio(tsk, false);
+}
+
+void dprio_detach(stru

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-13 Thread Sergey Oboguev
On Sat, Aug 9, 2014 at 6:04 AM, Mike Galbraith  wrote:

> You are not going to convince me that it is cool to assign an imaginary
> priority to a SCHED_FIFO class task, and still call the resulting mutant
> a SCHED_FIFO class task. Those things have defines semantics. It is
> not ok for a SCHED_FIFO task of a lower priority to completely ignore a
> SCHED_FIFO task of a higher priority because it's part of an application
> which has one or more wild cards

I am not quite sure what you were trying to say by this, but I am grateful to
you for expressing your perspective and arguments in this discussion.

For one thing, it stimulated me to perform systematic review of the solution
space/tree, square inch by square inch, resulting in exhaustive logical
coverage of the solution space and thus an informal logical proof that there
is no any uncovered subspace that might hold any undiscovered yet solution.

Just so the results do not remain scattered, I will try to summarize and
categorize the findings about available solutions to the general "critical
section (or critical activity) preemption" problem category, and then draw
a conclusion stemming from this summary.

**

1) Post-preemption solutions.

(Such as priority inheritance or proxy execution, or even humble yield_to).

By the time a post-preemption solution is engaged, a cost had already been
paid, and further cost is paid to execute the solution. Sometimes this
aggregate cost can be small and acceptable within the overall context of an
application, sometimes it can be too high. Sometimes dependency chain cannot be
expressed and thus post-preemption solution is not possible at all (other than
just waiting for the scheduler to eventually resume the blocking thread and
keeping paying the cost in the meanwhile).

I won't be further enumerating this subcategory here and will focus instead on
the space of preemption avoidance/deferral solutions.

**

2) Reduction in the incidence of thread preemption.

There are chief ways to reduce the incidence of thread preemption.

One way is by enlarging the scheduling quantum. Its simplistic use however may
well backfire: if a lock holder is preempted while holding a lock (or thread
executing non-lock-related critical activity is preempted while inside this
activity) enlarged scheduling quantum would result in longer delay before the
thread will get resumed. Therefore this measure if employed should really be
coupled with one of post-preemption solutions.

Another way is the reduction of thread preemption incidence due to another
thread being woken up. Currently existing schemes of this kind (use of
!WAKEUP_PREEMPTION or SCHED_BATCH) are indiscriminate and victimize woken up
thread for the sake of executing thread regardless of whether the latter is
inside a critical section or not, however means can be provided to refine this
behavior and victimize the wakee only in case if currently executing thread is
actually inside a critical section. (And also when the section is exited, to
switch over to the postponed wakee ASAP).

A variation on this theme is an attempt at mitigation of preemption effects via
LAST_BUDDY.

The extent to which this scheme will be effective or not is obviously
determined by the following factors.

First, the probability of thread preemption while it is holding a lock, which
in turn is determined by:

(a) Percentage of time a worker thread holds a lock (or executes non-lock
related critical activity). It does not matter too much whether a thread enters
a critical section one time, stays in it for a while and then exits or whether
the thread enters and exits a critical section many times in a row staying
within it each time only a for very short instant. What matters is the
aggregate percentage of execution time a thread spends while inside a critical
section. This yields a rough probability any random preemption of a thread
(either through natural expiration of scheduling timeslice or due to the
preemption by a wakee) will result in lock holder or other critical activity
preemption.

(b) To an extent, request handler execution time. Very short times (well under
typical scheduling quantum) separated by voluntary yielding (such as when
waiting on the queue to pick up the next processing request) would result in
reduced preemption probability due to end-of-quantum events, however I won't be
further considering this special case here. (It is also not a particularly good
idea to use FIFO ordering for dequeueing of events or requests from a queue by
worker threads, as opposed to LIFO ordering, as it negatively impacts cache
locality and increases context switch probability.)

(c) Reduction in preemption attempt frequency. That's where the impact of the
discussed scheme comes in. With the scheme on, thread gets chiefly preempted on
quantum expiration, but not intra-quantum in favor of wakee threads. The
product of probabilities (a) and (c) determines how likely the thre

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-09 Thread Sergey Oboguev
On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith  wrote:

> I see subversion of a perfectly functional and specified mechanism

Just wondering if the following line of thinking would sound just as much an
anathema from your perspective or perhaps a bit less terrible...

Proceeding from the observations (e.g. https://lkml.org/lkml/2014/8/8/492) that
representative critical section information is not pragmatically expressible at
development time or dynamically collectable by the application at run time, the
option still remains to put the weight of managing such information on the
shoulders of the final link in the chain, the system administrator, providing
him with application-specific guidelines and also with monitoring tools.

It might look approximately like this.

It might be possible to define the scheduling class or some other kind of
scheduling data entity for the tasks utilizing preemption control. The tasks
belonging to this class and having critical section currently active are
preemptible by RT or DL tasks just like normal threads, however they are
granted a limited and controlled degree of protection against preemption by
normal threads, and also limited ability to urgently preempt normal threads on
a wakeup.

Tasks inside this class may belong to one of the groups somewhat akin to
cgroups (perhaps may be even implemented as an extension to cgroups).

The properties of a group are:

* Maximum critical section duration (ms). This is not based on actual duration
of critical sections for the application and may exceed it manyfold. The
purpose is merely to be a safeguard against the runaways. If a task stays
inside a critical section longer than the specified time limit, it loses the
protection against the preemption and becomes for practical purposes a normal
thread.  The group keeps a statistics of how often the tasks in the group
overstay in critical section and exceed the specified limit.

* Percentage of CPU time that members of the group can collectively spend
inside their critical sections over some sampling interval while enjoying the
protection from preemption. This is the critical parameter. If group members
collectively spend larger share of CPU time in their critical sections
exceeding the specified limit, they start losing protection from preemption by
normal threads, to keep their protected time within the quota.

For example the administrator may designate that threads in group "MyDB" can
spend no more than 20% of system CPU time combined in the state of being
protected from preemption, while threads in group "MyVideoEncoder" can spend
not more than 10% of system CPU time in preemption-protected state.

If actual aggregate critical-section time spent by threads in all the groups
and also by RT tasks starts pushing some system-wide limit (perhaps
rt_bandwidth), available group percentages are dynamically scaled down, to
reserve some breathing space for normal tasks, and to depress groups in some
proportional way. Scaling down can be either proportional to the group quotas,
or can be controlled by separate scale-down weights.

A monitoring tool can display how often the tasks in the group requesting
protection from preemption are not granted it or lose it because of
overdrafting the group quota. System administrator may then either choose to
enlarge the group's quota, or leave it be and accept the application running
sub-optimally.

An application can also monitor the statistics on rejection of preemption
protection for its threads (and also actual preemption frequency while inside a
declared critical section state) and if the rate is high then issue an advisory
message to the administrator.

Furthermore:

Threads within a group need to have relative priorities. There should be a
way for a thread processing a highly critical section to be favored over a
thread processing medium-significance critical section.

There should also be a way to change a thread's group-relative priority both
from the inside and from the outside of a thread. If thread A queues an
important request for processing by thread B, A should be able to bump B's
group-relative priority.

Thread having non-zero group-relative priority is considered to be within a
critical section.

If thread having non-zero group-relative priority is woken up, it preempts
normal thread, as long as the group's critical section time usage is within
the group's quota.

The tricky thing is how to correlate priority ranges of different groups. I.e.
suppose there is a thread T1 belonging to group APP1 with group-relative
priority 10 within APP1 and a thread T2 belonging to group APP2 with
group-relative priority 20 within APP2. Which thread is more important and
should run first? Perhaps this can be left to system administrator who can
adjust "base priority" property of a group thus sliding groups upwards or
downwards relative to each other (or, generally, using some form of intergroup
priority mapping control).

This is not suggest any particul

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-08 Thread Sergey Oboguev
On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith  wrote:

> task priority cannot be used by any task to describe a critical section.
> I assert that is that there is _zero_ critical section information present.

This appears to be the crux of our disagreement.

This assertion is incorrect. The use of RT to bracket a critical section
obviously _does_ provide the following information:

1) designation of entry point for the start of critical activity

2) designation of exit point, albeit with timing not known in advance at entry
time

3) priority value which embodies a developer's assessment of the importance of
this critical activity relative to other critical activities within the
application or coherent application set, and thus a statement about the cost of
the activity's preemption for this application or application set

What priority protection _does not_ provide is:

1) advance (at entry time) estimate of the duration of the activity

2) the measure of preemption cost in "objective" (uniform) units that would
span across unrelated applications

3) and in units that can be easily stacked against the policy-specified cost of
deferring and penalizing normal tasks for too long (although to an extent in RT
use case this is done by rt_bandwidth)

The implication of (2) and (3) is that one cannot easily have a system
management slider saying "penalize application A for the benefit of application
B (or default undesignated applications) by giving weights so-and-so to their
cost factors"... Well, in a way and to an extent one can do it by remapping
priority ranges for tasks A and B, however since priority interface was not
designed for it grounds up that would be cumbersome.

The provisioning of this missing information however is not realistically
possible outside of the simplest use cases ran in a fixed configuration under a
fixed workload, as I elaborated in the previous message.  Outside of such
irrealistic "lab" samples, even in the case of the simplest cost function the
estimates of T and X are not pragmatically obtainable. The estimates of Y and
T2 (cut-off point for Y accrual) are likewise hard or even harder to obtain.
Thus even the simplest cost function cannot be pragmatically provided, let
alone any more complex cost function.

The issue is not with inventing some sophisticated cost function and system
interface to plug it in. The issue is that valid input data that the cost
function would need to rely on is not pragmatically/economically obtainable.
(And this issue is not pragmatically solvable through dynamically collecting
such data at run time as well.)

Therefore there is no any pragmatically usable "better" solution in the domain
of preemption-deferral solutions (rather than post-preemption solutions) that
may be discovered yet.

It stands to reason that the choice in this domain is really, and will ever
pragmatically stay, between using the long-existing techniques (i.e. priority
protection or schedctl-style preempt delay, with the latter having the
limitations I outlined earlier) or not using anything at all.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-06 Thread Sergey Oboguev
On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith
 wrote:

>> > SCHED_NORMAL where priority escalation does not work as preemption proofing
>>
>> Remember, DPRIO is not for lock holders only.
>>
>> Using DPRIO within SCHED_NORMAL policy would make sense for an application 
>> that
>> has "soft" time-urgent section where it believes strong protection
>> from preemption
>> is not really necessary, and just a greater claim to CPU time share would do,
>> in cases where the application does not know beforehand if the section will 
>> be
>> short or long, and in majority of cases is short (sub-millisecond), but
>> occasionally can take longer.
>
> Every single time that SCHED_NORMAL task boosts its priority (nice)
> during a preemption, the math has already been done, vruntime has
> already been adjusted.
> Sure, when it gets the CPU back, its usage will
> be weighed differently, it will become more resistant to preemption, but
> in no way immune.  There is nothing remotely deterministic about this,
> making it somewhat of an oxymoron when combined with critical section.

But you overlooked the point I was trying to convey in the paragraph you
are responding to.

Apart from SCHED_NORMAL being a marginal use case, if it is used at all, I do
not see it being used for lock-holding or similar critical section where an
application wants to avoid the preemption.

I can see DPRIO(SCHED_NORMAL) being used in the same cases as an application
would use nice for a temporary section, i.e. when it has a job that needs to be
processed relatively promptly over some time interval but not really
super-urgently and hard guarantees are not needed, i.e. when the application
simply wants to have an improved claim for CPU resources compared to normal
threads over let us say the next half-second or so. It is ok if the application
gets preempted, all it cares about is a longer timeframe ("next half-second")
rather than shorter and immediate timeframe ("next millisecond").

The only reason why anyone would want to use DPRIO instead of regular nice in
this case is because it might be unknown beforehand whether the job will be
short or might take a longer time, with majority of work items being very short
but occasionally taking longer. In this case using DPRIO would let to cut the
overhead for majority of section instances. To reiterate, this is a marginal
and most likely rare use case, but given the existence of uniform interface
I just do not see why to block it on purpose.

> If some kthread prioritizes _itself_ and mucks up application
> performance, file a bug report, that kthread is busted.  Anything a user
> or application does with realtime priorities is on them.

kthreads do not need RT, they just use spinlocks ;-)

On a serious note though, I am certainly not saying that injudicious use of RT
(or even nice) cannot disrupt the system, but is it reason enough to summarily
condemn the judicious use as well?

>> I disagree. The exact problem is that it is not a developer who initiates the
>> preemption, but the kernel or another part of application code that is 
>> unaware
>> of other thread's condition and doing it blindly, lacking the information 
>> about
>> the state of the thread being preempted and the expected cost of its 
>> preemption
>> in this state. DPRIO is a way to communicate this information.

> What DPRIO clearly does NOT do is to describe critical sections to the
> kernel.

First of all let's reflect that your argument is not with DPRIO as such. DPRIO
after all is not a separate scheduling mode, but just a method to reduce the
overhead of regular set_priority calls (i.e. sched_setattr & friends).

You argument is with the use of elevated priority as such, and you are saying
that using RT priority range (or high nice) does not convey to the kernel the
information about the critical section.

I do not agree with this, not wholly anyway. First of all, it is obvious that
set_priority does convey some information about the section, so perhaps a more
accurate re-formulation of your argument could be that it is imperfect,
insufficient information.

Let's try to imagine then what could make more perfect information. It
obviously should be some cost function describing the cost that would be
incurred if the task gets preempted. Something that would say (if we take the
simplest form) "if you preempt me within the next T microseconds (unless I
cancel or modify this mode), this preemption would incur cost X upfront further
accruing at a rate Y".

One issue I see with this approach is that in real life it might be very hard
for a developer to quantify the values for X, Y and T. Developer can easily
know that he wants to avoid the preemption in a given section, but actually
quantifying the cost of preemption (X, Y) would take a lot of effort
(benchmarking) and furthermore really cannot be assigned statically, as the
cost varies depending on the load pattern and site-specific configuration.
Furthermore, when dealing with multiple 

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-05 Thread Sergey Oboguev
On Sun, Aug 3, 2014 at 2:56 AM, Mike Galbraith  wrote:

> SCHED_NORMAL where priority escalation does not work as preemption proofing

Remember, DPRIO is not for lock holders only.

Using DPRIO within SCHED_NORMAL policy would make sense for an application that
has "soft" time-urgent section where it believes strong protection
from preemption
is not really necessary, and just a greater claim to CPU time share would do,
in cases where the application does not know beforehand if the section will be
short or long, and in majority of cases is short (sub-millisecond), but
occasionally can take longer.

> what I see is a programmer using a mechanism designed
> to initiate preemption arming his tasks with countermeasures to the very
> thing he initiates.

I disagree. The exact problem is that it is not a developer who initiates the
preemption, but the kernel or another part of application code that is unaware
of other thread's condition and doing it blindly, lacking the information about
the state of the thread being preempted and the expected cost of its preemption
in this state. DPRIO is a way to communicate this information.

> Deferred preempt seems to be what you want, but you
> invented something very different.

"Simple" deferred preempt is one use case.

More complex case is "ranked" deferred preemption, when there are multiple
contending contexts, and there is a need to express relative costs of
victimizing one vs. another.

> I'm just not seeing the beauty in your patch.

Perhaps I might have a dissenting sense of beauty.
But then, it is not only about the beauty (which is subjective anyway), but
even more so about the pudding.

Seriously though, it's really simple: the whole range of available remedies is
divided across post-preemption solutions and preemption-avoidance solutions
(and of course designing app structure for minimizing the contention in the
first place, but for the sake of this discussion we can assume this had been
done to the extent possible). Post-preemption solutions unavoidably incur cost
(and a part of this cost is incurred even before the solution can be engaged).
If this cost can be maintained insignificant for the given use case, great.
However what do you propose to do with those use cases where it cannot? To tell
a developer (or IT manager) "we do not care about your 5% or 20% losses, and if
you do not like it, use another OS that would work better for you"? This would
not sound too productive to me.

This leaves then preemption-avoidance solutions which are about communicating
the cost of the preemption to the kernel in one way or another. DPRIO is one of
such solutions, and I believe is as good or better than the others (I explained
earlier in which cases it is better than schedctl).

(Seemingly transcending this divide may only be the delegation of scheduling
decisions from the kernel to the userspace via mechanisms attempted in a number
of experimental OSes and likely to see the light of the day again with the
expected raise of library OSes/unikernels, but at the end of the day this is
still a set of preemption-avoidance vs. post-preemption solutions, just boxed
differently, and utilizing greater application state knowledge bandwidth
available within the single address space.)

I hear you dislike, but I think it is misplaced.

I am unsure about the exact origins of your outlook, but I suspect it might
possibly be arising at least in part from an aspiration for finding a "holy
grail" solution that would cover all the cases (after all, kernel developers
tend to be psychologically perfectionists, at least within this little hobby of
kernel development), and while such an aspiration is admirable per se, but
unfortunately this holy grail just does not exist.

Likewise, an aspiration to make sure that everything "mixes and matches" with
absolutely everything else in all cases while it may be admirable but is
futile in its absolutistic form because there are obvious limits to it, and
furthermore applications that need to rely on preemption-avoidance are most
often primary-importance applications for the installation, and as such
mixing and matching with absolutely everything else is not essential and
definitely not a priority compared to letting the primary application to run
better. Furthermore, many if not most of primary-importance applications
deployments today are already single-purposed, and with continued VM sprawl
this trend would only grow, and with it the emphasis on letting a primary
application or practical well-defined application set run better over an
absolutistic arbitrary mixing and matching will grow as well. This of course
certainly not to deny the importance of mix-and-match environments such as
desktop systems or personal devices, but it is important to recognize the
difference in requirements/priorities between the environments.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kern

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-05 Thread Sergey Oboguev
On Sun, Aug 3, 2014 at 10:30 AM, Andi Kleen  wrote:
>> (One example might be virtual machine that runs guest operating system that 
>> is
>> not paravirtualized or can be paravirtualized only to a limited extent. The 
>> VM
>> might guess that preemption of VCPU thread that is processing events such as
>> IPI interrupts, clock interrupts or certain device interrupts is likely to
>> cause overall performance degradation due to other VCPUs spinning for an IPI
>> response or spinning waiting for a spinlock, and thought the general kinds of
>> these dependencies may be foreseen, but actual dependency chains between 
>> VCPUs
>> cannot be established at run time.)
>
> PAUSE loop exiting (PLE) can handle it in limited fashion on VMs
> even without paravirtualization.


The key word is "limited". Sometimes the use of PLE can help to a limited
indeed (see below) extent, and sometimes it can even be counter-productive,
which in turn can also be mitigated (http://lwn.net/Articles/424960/) but again
in a limited fashion.

This indeed is similar to the spinlock example discussed in the previous
message. Similarly to PI/PE, the use of PLE and other spin-loop detection
techniques is about trying to minimize the cost of after-
inopportune-preemption actions to whatever (unavoidably limited, that's the key
thing) extent they can be minimized, whereas priority protection is about the
avoidance of incurring these costs in the first place.

And then, x86 VMs is just one use case, whereas there are others with no PLE or
similar recourses. DPRIO as a matter of fact happened to be conceived within
the context of the project running legacy non-x86 OS that does not have pause
in the loops, and the loops are scattered throughout the pre-existing binary
code in huge numbers and are not pragmatically instrumentable. But this just
exemplifies a general pattern of having to do with concurrency in a component
being driven from or driving another component that is for practical purposes
"set in stone" (3rd party, legacy, out of the scope of the project etc.)

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-05 Thread Sergey Oboguev
On Sun, Aug 3, 2014 at 1:30 AM, Pavel Machek  wrote:

> it seems to be a security issue to me.
> If root renices the application to high nice value, application should
> not be able to work around it by the DPRIO interface.

There is no such issue.

Since 2.6.12, Linux does allow a task that had been renice'd to increase its
priority back as long as it is within RLIMIT_NICE. See see man page for nice(2)
and the reference there under EPERM to RLIMIT_NICE. Or sys_nice(...) and
can_nice(...) in kernel/sched/core.c.

If the administrator wants to clamp down the task in a way it would be unable
to come back, he should change the task's rlimit for RLIMIT_NICE.

DPRIO does honor the change in RLIMIT_NICE (as well as in RLIMIT_RTPRIO) and
won't let a task cross those limits.

> You mean "we rely on applications handling the situation they can't and will
not handle"?

Not really.

There are two different cases to be considered.

One is when an application's thread priority is changed from inside the
application, in a coordinated code, and the code is specifically set up to
handle asynchronous thread priority changes.

(For example, if the application is a VM, and a virtual device sends an
interrupt to a VCPU, the device handler may want to bump up the VCPU thread
priority so the interrupt gets processed promptly. When VCPU notices and
dequeues the sent interrupt, it reevaluates the thread priority based on the
totality of synchronous intra-VCPU conditions and currently visible
asynchronous conditions such as a set of pending interrupts and sets new thread
priority accordingly.)

Another case is when an application's thread priority is changed from the
outside of the application in an arbitrary way. There is no radical difference
in this case between DPRIO and regular set_priority.

Such an external priority change can indeed be disruptive for an application,
but it is disruptive for an application that uses regular set_priority aw well.
Suppose the thread was running some critical tasks and/or holding some critical
locks, and used regular set_priority to that end, and then was knocked down.
This would be disruptive for an application using regular set_priority just as
it would be for one using DPRIO. The exact mechanics of the disruption would be
somewhat different, but the disruption would be present in both cases.

Likewise, an application has means for recovery both in regular set_priority
and DPRIO cases. In the case of an application using regular set_priority the
recovery will automatically happen on the next set_priority call. In DPRIO case
it may take a bunch of dprio_set calls, but given that they are meant to be
used in high-frequency invocation case, the recovery is likely to happen pretty
fast as well, and after a certain number of cycles the "writeback" priority
change cached in the userspace is likely to get "written through" to the
kernel, albeit this process is somewhat "stochastic" and sequences can be
constructed when it won't be for quite a while. If the application wished to
give it some guaranteed predictability, it could use dprio_setnow(prio,
DPRIO_FORCE) in every N-th invocation instead of dprio_set(prio).

Nevertheless this indirection is one reason why I do not think making regular
set_priority a wrapper around DPRIO is a good idea. It would strip application
developer of direct control. unless DPRIO_FORCE flag was used, but then it
becomes regular set_priority again.

Another reason is that error reporting in DPRIO case is delayed (e.g. via a
callback in DPRIO library implementation), and that's different from the
semantics of regular set_priority interface.

To summarize it, regular sety_priority (e.g. sched_setattr) defines the
interface that is immediate (synchronous) and uncached, whereas DPRIO is
deferred (asynchronous) and cached. Semantics is really different to let the
former be wrapped around the latter without a distortion of the semantics.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-02 Thread Sergey Oboguev
On Wed, Jul 30, 2014 at 6:02 AM, Pavel Machek  wrote:
> Hi!
>
>> One of the intended purposes of this facility (but its not sole purpose) is 
>> to
>> render a lightweight mechanism for priority protection of lock-holding 
>> critical
>> sections that would be an adequate match for lightweight locking primitives
>> such as futex, with both featuring a fast path completing within the
>> userspace.

> Do we get a manpage describing the interface...?

At this point it is just an RFC, and the only available write-up is an article
(URL is in the original message).

There has been no word from the maintainers yet whether the proposal appears
to be a "go" or "no-go" in general.

Assuming the response was favorable, final polishing on the patch can be done
(such as perhaps replacing or augmenting authlist by a capability), and man page
can be added at this point as well.

> Would it make sense to make set_priority a "vsyscall" so it is fast enough,
and delayed_set_priority does not need to be exposed to userspace?

Regular "set priority" cannot be wrapped around "deferred set priority".

"Deferred set priority" acts only on current thread.

Even more importantly, its use also relies on a thread caching the application's
current knowledge of the thread priority in the userspace, and if the thread
priority had been changed from the outside of the application (or even by
another thread within the same application), this knowledge becomes invalid,
and then the application is responsible for performing whatever recovery action
is appropriate.

Thus DPRIO is not a replacement for fully-functional "set priority",
but rather a
specialized tool for certain use cases.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-08-02 Thread Sergey Oboguev
On Mon, Jul 28, 2014 at 12:24 AM, Mike Galbraith
 wrote:
> On Sun, 2014-07-27 at 18:19 -0700, Andi Kleen wrote:
>> Sergey Oboguev  writes:
>>
>> > [This is a repost of the message from few day ago, with patch file
>> > inline instead of being pointed by the URL.]
>>
>> Have you checked out the preemption control that was posted some time
>> ago? It did essentially the same thing, but somewhat simpler than your
>> patch.
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html
>>
>> Yes I agree with you that lock preemption is a serious issue that needs 
>> solving.
>
> Yeah, it's a problem, and well known.
>
> One mitigation mechanism that exists in the stock kernel today is the
> LAST_BUDDY scheduler feature.  That took pgsql benchmarks from "shite"
> to "shiny", and specifically targeted this issue.
>
> Another is SCHED_BATCH, which can solve a lot of the lock problem by
> eliminating wakeup preemption within an application.  One could also
> create an extended batch class which is not only immune from other
> SCHED_BATCH and/or SCHED_IDLE tasks, but all SCHED_NORMAL wakeup
> preemption.  Trouble is that killing wakeup preemption precludes very
> fast very light tasks competing with hogs for CPU time.  If your load
> depends upon these performing well, you have a problem.
>
> Mechanism #3 is use of realtime scheduler classes.  This one isn't
> really a mitigation mechanism, it's more like donning a super suit.
>
> So three mechanisms exist, the third being supremely effective, but high
> frequency usage is expensive, ergo huge patch.
>
> The lock holder preemption problem being identical to the problem RT
> faces with kernel locks...
>
> A lazy preempt implementation ala RT wouldn't have the SCHED_BATCH
> problem, but would have a problem in that should critical sections not
> be as tiny as it should be, every time you dodge preemption you're
> fighting the fair engine, may pay heavily in terms of scheduling
> latency.  Not a big hairy deal, if it hurts, don't do that.  Bigger
> issue is that you have to pop into the kernel on lock acquisition and
> release to avoid jabbering with the kernel via some public phone.
> Popping into the kernel, if say some futex were victimized, also erases
> the "f" in futex, and restricting cost to consumer won't be any easier.
>
> The difference wrt cost acceptability is that the RT issue is not a
> corner case, it's core issue resulting from the nature of the RT beast
> itself, so the feature not being free is less annoying.  A corner case
> fix OTOH should not impact the general case at all.


When reasoning about concurrency management it may be helpful to keep in mind
the fundamental perspective that the problem space and solution space in this
area are fragmented -- just as your message exemplifies as well, but it also
applies across the board to all other solution techniques. There is no
all-unifying solution that works in all use cases and for all purposes.

This applies even to seemingly well-defined problems such as lock holder
preemption avoidance.

One of the divisions on a broad scale is between cases when wait/dependency
chain can be made explicit at run time (e.g. application/system can tell that
thread A is waiting on thread B which is waiting on thread C) and those cases
when the dependency cannot be explicitly expressed and information on the
dependency is lacking.

In the former case some form of priority inheritance or proxy execution might
often (but not always) work well.

In the latter case PI/PE cannot be applied. This case occurs when a component
needs to implement time-urgent section acting on behalf of (e.g. in response to
an event in) another component in the application and the latter is not (or
often cannot practically be) instrumented with the code that expresses its
inner dependencies and thus the information about the dependencies is lacking.

(One example might be virtual machine that runs guest operating system that is
not paravirtualized or can be paravirtualized only to a limited extent. The VM
might guess that preemption of VCPU thread that is processing events such as
IPI interrupts, clock interrupts or certain device interrupts is likely to
cause overall performance degradation due to other VCPUs spinning for an IPI
response or spinning waiting for a spinlock, and thought the general kinds of
these dependencies may be foreseen, but actual dependency chains between VCPUs
cannot be established at run time.)

In cases when dependency information is lacking, priority protection remains
the only effective recourse.

Furthermore, even in cases when dependency information is available, PI/PE per
se is not always a satisfactory or sufficient solution. Consider for 

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-07-27 Thread Sergey Oboguev
On Sun, Jul 27, 2014 at 6:19 PM, Andi Kleen  wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>
> Have you checked out the preemption control that was posted some time
> ago? It did essentially the same thing, but somewhat simpler than your
> patch.
>
> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html

Yes, I have seen this discussion. The patch suggested by Khalid implements a
solution very much resembling Solaris/AIX schedctl. Schedctl is less generic
and powerful than dprio. I compared dprio vs. schedctl in the write-up
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

To quote from there,

[--- Quote ---]

The Solaris schedctl [...]
does not provide a way to associate a priority with the resource
whose lock is being held (or, more generally, with thread application-specific
logical state; see the footnote below). An application is likely to have a
range of locks with different criticality levels and different needs for
holder protection [*]. For some locks, holder preemption may be tolerated
somewhat, while other locks are highly critical, furthermore for some lock
holders preemption by a high-priority thread is acceptable but not a preemption
by a low-priority thread. The Solaris/AIX schedctl does not provide a
capability for priority ranging relative to the context of the whole
application and other processes in the system.

[*] We refer just to locks here for simplicity, but the need of a thread
for preemption control does not reduce to locks held alone, and may
result from other intra-application state conditions, such as executing
a time-urgent fragment of code in response to a high-priority event
(that may potentially be blocking for other threads) or other code
paths that can lead to wait chains unless completed promptly.

Second, in some cases application may need to perform time-urgent processing
without knowing in advance how long it will take. In the majority of cases the
processing may be very short (a fraction of a scheduling timeslice), but
occasionally may take much longer (such as a fraction of a second). Since
schedctl  would not be effective in the latter case, an application would have
to resort to system calls for thread priority control in all cases [*], even
in the majority of "short processing" cases, with all the overhead of this
approach.

[*] Or introduce extra complexity, most likely very cumbersome, by trying
to gauge and monitor the accumulated duration of the processing, with
the intention to transition from schedctl to thread priority elevation
once a threshold has been reached.

[--- End of quote ---]

Even so, I felt somewhat puzzled by the response to Khalid's
delay-preempt patch.
While some arguments put forth against it were certainly valid in their own
right, but somehow their focus seemed to be that the solution won't interoperate
well with all the conceivable setups and application mixes, won't solve all the
concurrency issues, and the worst of all won't slice bread either. Whereas my
perception (perhaps incorrectly) was that this patch was not meant to solve a
whole range of problems and to be a feature enabled by default in a generic
system, but rather a specialized  feature configurable in special-purpose
systems (e.g. database servers, Khalid was doing it for Oracle, and his JVM
use case I believe is also in this context) dedicated to running a
primary-importance application that utilizes this mechanism and meant to solve
a very particular problem of this specific category of system deployment cases.
It appeared to me that the participants to delay-preempt patch discussion
might have had different idea of the implied use scope of the suggested
feature, and it might have influenced the direction of the discussion.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-07-27 Thread Sergey Oboguev
On Sat, Jul 26, 2014 at 9:02 PM, Mike Galbraith
 wrote:
> On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote:
>> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
>>  wrote:
>> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
>> >> [This is a repost of the message from few day ago, with patch file
>> >> inline instead of being pointed by the URL.]
>> >>
>> >> This patch is intended to improve the support for fine-grain parallel
>> >> applications that may sometimes need to change the priority of their 
>> >> threads at
>> >> a very high rate, hundreds or even thousands of times per scheduling 
>> >> timeslice.
>> >>
>> >> These are typically applications that have to execute short or very short
>> >> lock-holding critical or otherwise time-urgent sections of code at a very 
>> >> high
>> >> frequency and need to protect these sections with "set priority" system 
>> >> calls,
>> >> one "set priority" call to elevate current thread priority before 
>> >> entering the
>> >> critical or time-urgent section, followed by another call to downgrade 
>> >> thread
>> >> priority at the completion of the section. Due to the high frequency of
>> >> entering and leaving critical or time-urgent sections, the cost of these 
>> >> "set
>> >> priority" system calls may raise to a noticeable part of an application's
>> >> overall expended CPU time. Proposed "deferred set priority" facility 
>> >> allows to
>> >> largely eliminate the cost of these system calls.
>> >
>> > So you essentially want to ship preempt_disable() off to userspace?
>> >
>>
>> Only to the extent preemption control is already exported to the userspace 
>> and
>> a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
>> RLIMIT_NICE and capable(CAP_SYS_NICE).
>>
>> DPRIO does not amplify a taks's capability to elevate its priority and block
>> other tasks, it just reduces the computational cost of frequest
>> sched_setattr(2) calls.

> You are abusing realtime

I am unsure why you would label priority ceiling for locks and priority
protection for other forms of time-urgent sections as an "abuse".

It would appear you start from a presumption that the sole valid purpose for
ranging task priorities should ever be only hard real-time applications such as
plant process control etc., but that's not a valid or provable presumption, but
rather an article of faith -- a faith, as you acknowledge, a lot of developers
do not share, and a rational argument to the contrary of this faith is that
there are no all-fitting satisfactory and practical alternative solutions to
the problems that are being solved with those tools, that's the key reason why
they are used. The issue then distills to a more basic question of whether this
faith should be imposed on the dissenting application developers, and whether
Linux should provide a mechanism or a policy.

As for DPRIO specifically, while it may encourage somewhat the use of priority
ceiling and priority protection, but it does not provide an additional basic
mechanism beyond one already exported by the kernel (i.e. "set priority"), it
just makes this pre-existing basic mechanism cheaper to use in certain use
cases.

> if what you want/need is a privileged userspace lock

The problem is not reducible to locks. Applications also have time-urgent
critical section that arise from wait and interaction chains not expressible
via locking notation.

> you could make a flavor of futex that makes the owner non-preemptible

Lock owner should definitely be preemptible by more time-urgent tasks.

> it's not like multiple users could coexist peacefully anyway

It depends. A common sense suggests not to run an air traffic control system on
the same machine as an airline CRM database system, but perhaps one might
co-host CRM and ERP database instances on the same machine.

Indeed, applications that are installed with the rights granting them an access
to an elevated priority are generally those that are important for the purpose
of the system they are deployed on. The machine they are installed on may
either be dedicated to running this particular application, or it may be used
for running a set of primary-importance applications that can coexist.

As an obvious rule of thumb, applications using elevated priorities for the
sake of deterministic response time should not be combined "on equal footing"
with non-deterministic applications using elevated priorities for the reasons of
better overall syste

Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-07-26 Thread Sergey Oboguev
On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
 wrote:
> On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>>
>> This patch is intended to improve the support for fine-grain parallel
>> applications that may sometimes need to change the priority of their threads 
>> at
>> a very high rate, hundreds or even thousands of times per scheduling 
>> timeslice.
>>
>> These are typically applications that have to execute short or very short
>> lock-holding critical or otherwise time-urgent sections of code at a very 
>> high
>> frequency and need to protect these sections with "set priority" system 
>> calls,
>> one "set priority" call to elevate current thread priority before entering 
>> the
>> critical or time-urgent section, followed by another call to downgrade thread
>> priority at the completion of the section. Due to the high frequency of
>> entering and leaving critical or time-urgent sections, the cost of these "set
>> priority" system calls may raise to a noticeable part of an application's
>> overall expended CPU time. Proposed "deferred set priority" facility allows 
>> to
>> largely eliminate the cost of these system calls.
>
> So you essentially want to ship preempt_disable() off to userspace?
>

Only to the extent preemption control is already exported to the userspace and
a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
RLIMIT_NICE and capable(CAP_SYS_NICE).

DPRIO does not amplify a taks's capability to elevate its priority and block
other tasks, it just reduces the computational cost of frequest
sched_setattr(2) calls.

>
> -Mike
>
>> Instead of executing a system call to elevate its thread priority, an
>> application simply writes its desired priority level to a designated memory
>> location in the userspace. When the kernel attempts to preempt the thread...

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched: deferred set priority (dprio)

2014-07-26 Thread Sergey Oboguev
On Fri, Jul 25, 2014 at 1:12 PM, Andy Lutomirski  wrote:
> On 07/25/2014 12:45 PM, Sergey Oboguev wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>>
>> This patch is intended to improve the support for fine-grain parallel
>> applications that may sometimes need to change the priority of their threads 
>> at
>> a very high rate, hundreds or even thousands of times per scheduling 
>> timeslice.
>
> What is the authlist stuff for?  It looks complicated and it seems like
> it should be unnecessary.

The authlist controls who is allowed to use the DPRIO facility.

DPRIO works by making a check at __schedule() time for whether a deferred set
priority request is pending, and if so, then performing an equivalent of
sched_setattr(), minus security checks, before the rescheduling.

This introduces additional latency at task switch time when a deferred set
priority request is pending -- albeit normally a very small latency, but
non-zero one and a malicious user can also manoeuvre into increasing it. There
are two parts to this latency. One is more or less constant part in the order
of 1 usec, that's basic __sched_setscheduler() code. The other part is due to
possible priority inheritance chain adjustment in rt_mutex_adjust_pi() and
depends on the length of the chain. Malicious user might conceivably construct
a very long chain, long enough for processing of this chain at the time of
__schedule to cause an objectionable latency. On systems where this might be a
concern, administrator may therefore want to restrict the use of DPRIO to
legitimate trusted applications (or rather users of those).

If a common feeling is that such a safeguard is overly paranoid, I would be
happy to drop it, but I feel some security control option may be desirable
for the mentioned reason.

Rethinking it now though, perhaps a simpler alternative could be adding a
capability for DPRIO plus a system-wide setting as to whether this capability
is required for the use of DPRIO or DPRIO is allowed for everyone on the
system. The benefit of this approach might also be that the administrator can
use setcap on trusted executable files installed in the system to grant them
this capability.

> What's with the save/restore code?  What's it saving and restoring?

It is used inside execve.

There are two DPRIO-related elements stored in task_struct (plus one more for
the debug code only).

One is an address of a location in the userspace that is used for
the userspace <-> kernel communication.

Another element stored in task_struct is the pointer to per-task DPRIO
information block kept inside the kernel. This block holds pre-authorized
priorities.

(The address of userspace location is stored in task_struct itself, rather than
in DPRIO info block to avoid extra indirection in the frequent path.)

Successful execve must result in shutting down DPRIO for the task, i.e.
resetting these two pointers to NULL (this must be performed before new
executable image is loaded, otherwise a corruption of new image memory can
result) and also the deallocation of DPRIO information block. If execve fails
however and control returns to the original image, DPRIO settings should be
retained.

Before calling image loader, execve (i.e. do_execve_common) invokes
dprio_save_reset_context() to save these two pointers in on-stack backup
structure and to reset their values in task_struct to NULL. If new image loads
fine, DPRIO information block is deallocated by dprio_free_context() and
control is passed on to the new image, with pointers in task_struct already
reset to NULL. If image loading fails, error recovery path invokes
dprio_restore_context() to restore the pointers from the backup structure back
into task_struct.

- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] sched: deferred set priority (dprio)

2014-07-25 Thread Sergey Oboguev
[This is a repost of the message from few day ago, with patch file
inline instead of being pointed by the URL.]

This patch is intended to improve the support for fine-grain parallel
applications that may sometimes need to change the priority of their threads at
a very high rate, hundreds or even thousands of times per scheduling timeslice.

These are typically applications that have to execute short or very short
lock-holding critical or otherwise time-urgent sections of code at a very high
frequency and need to protect these sections with "set priority" system calls,
one "set priority" call to elevate current thread priority before entering the
critical or time-urgent section, followed by another call to downgrade thread
priority at the completion of the section. Due to the high frequency of
entering and leaving critical or time-urgent sections, the cost of these "set
priority" system calls may raise to a noticeable part of an application's
overall expended CPU time. Proposed "deferred set priority" facility allows to
largely eliminate the cost of these system calls.

Instead of executing a system call to elevate its thread priority, an
application simply writes its desired priority level to a designated memory
location in the userspace. When the kernel attempts to preempt the thread, it
first checks the content of this location, and if the application's stated
request to change its priority has been posted in the designated memory area,
the kernel will execute this request and alter the priority of the thread being
preempted before performing a rescheduling, and then make a scheduling decision
based on the new thread priority level thus implementing the priority
protection of the critical or time-urgent section desired by the application.
In a predominant number of cases however, an application will complete the
critical section before the end of the current timeslice and cancel or alter
the request held in the userspace area. Thus a vast majority of an
application's change priority requests will be handled and mutually cancelled
or coalesced within the userspace, at a very low overhead and without incurring
the cost of a system call, while maintaining safe preemption control. The cost
of an actual kernel-level "set priority" operation is incurred only if an
application is actually being preempted while inside the critical section, i.e.
typically at most once per scheduling timeslice instead of hundreds or
thousands "set priority" system calls in the same timeslice.

One of the intended purposes of this facility (but its not sole purpose) is to
render a lightweight mechanism for priority protection of lock-holding critical
sections that would be an adequate match for lightweight locking primitives
such as futex, with both featuring a fast path completing within the userspace.

More detailed description can be found in:
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

The patch is currently based on linux-3.15.2.

User-level library implementing userspace-side boilerplate code:
https://github.com/oboguev/dprio/tree/master/src/userlib

Test set:
https://github.com/oboguev/dprio/tree/master/src/test

The patch is enabled with CONFIG_DEFERRED_SETPRIO.
There is also a few other config settings: a setting for dprio debug code,
a setting that controls the initial value for the authorization list restricting
the use of the facility based on user or group ids, and a setting to improve
the determinism in the rescheduling latency when dprio request is pending
under low-memory conditions. Please see dprio.txt for details.

Comments would be appreciated.

Thanks,
Sergey

Signed-off-by: Sergey Oboguev 
---
 fs/exec.c  |8 +
 fs/proc/Makefile   |1 +
 fs/proc/authlist.c |  493 +
 include/linux/authlist.h   |  114 +++
 include/linux/dprio.h  |  130 
 include/linux/init_task.h  |   17 +
 include/linux/sched.h  |   15 +
 include/uapi/linux/prctl.h |2 +
 init/Kconfig   |2 +
 kernel/Kconfig.dprio   |   81 +
 kernel/exit.c  |6 +
 kernel/fork.c  |   87 +-
 kernel/sched/Makefile  |1 +
 kernel/sched/core.c|  200 +++-
 kernel/sched/dprio.c   |  734 
 kernel/sys.c   |6 +
 kernel/sysctl.c|   11 +
 17 files changed, 1897 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 238b7aa..9f5b649 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1433,6 +1434,7 @@ static int do_execve_common(struct filename *filename,
struct file *file;
struct files_struct *displaced;
int retval;
+   struct dprio_saved_context dprio_context;

if (IS_ERR(filename))
return 

[PATCH RFC] sched: deferred set priority (dprio)

2014-07-21 Thread Sergey Oboguev
This patch is intended to improve the support for fine-grain parallel
applications that may sometimes need to change the priority of their threads at
a very high rate, hundreds or even thousands of times per scheduling timeslice.

These are typically applications that have to execute short or very short
lock-holding critical or otherwise time-urgent sections of code at a very high
frequency and need to protect these sections with "set priority" system calls,
one "set priority" call to elevate current thread priority before entering the
critical or time-urgent section, followed by another call to downgrade thread
priority at the completion of the section. Due to the high frequency of
entering and leaving critical or time-urgent sections, the cost of these "set
priority" system calls may raise to a noticeable part of an application's
overall expended CPU time. Proposed "deferred set priority" facility allows to
largely eliminate the cost of these system calls.

Instead of executing a system call to elevate its thread priority, an
application simply writes its desired priority level to a designated memory
location in the userspace. When the kernel attempts to preempt the thread, it
first checks the content of this location, and if the application's stated
request to change its priority has been posted in the designated memory area,
the kernel will execute this request and alter the priority of the thread being
preempted before performing a rescheduling, and then make a scheduling decision
based on the new thread priority level thus implementing the priority
protection of the critical or time-urgent section desired by the application.
In a predominant number of cases however, an application will complete the
critical section before the end of the current timeslice and cancel or alter
the request held in the userspace area. Thus a vast majority of an
application's change priority requests will be handled and mutually cancelled
or coalesced within the userspace, at a very low overhead and without incurring
the cost of a system call, while maintaining safe preemption control. The cost
of an actual kernel-level "set priority" operation is incurred only if an
application is actually being preempted while inside the critical section, i.e.
typically at most once per scheduling timeslice instead of hundreds or
thousands "set priority" system calls in the same timeslice.

One of the intended purposes of this facility (but its not sole purpose) is to
render a lightweight mechanism for priority protection of lock-holding critical
sections that would be an adequate match for lightweight locking primitives
such as futex, with both featuring a fast path completing within the userspace.

More detailed description can be found in:
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

The patch is currently based on 3.15.2.

Patch file:
https://github.com/oboguev/dprio/blob/master/patch/linux-3.15.2-dprio.patch
https://raw.githubusercontent.com/oboguev/dprio/master/patch/linux-3.15.2-dprio.patch

Modified source files:
https://github.com/oboguev/dprio/tree/master/src/linux-3.15.2

User-level library implementing userspace-side boilerplate code:
https://github.com/oboguev/dprio/tree/master/src/userlib

Test set:
https://github.com/oboguev/dprio/tree/master/src/test

The patch is enabled with CONFIG_DEFERRED_SETPRIO.

There is also a config setting for the debug code and a setting that controls
the initial value of authorization list restricting the use of the facility
based on user or group ids. Please see dprio.txt for details.

Comments would be appreciated.

Thanks,
Sergey

Signed-off-by: Sergey Oboguev 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/