Re: [PATCH -v2 0/7] New RT Task Balancing -v2

2007-10-23 Thread Ingo Molnar

* Steven Rostedt [EMAIL PROTECTED] wrote:

   Changes since V1:
 Updated to git tree 55b70a0300b873c0ec7ea6e33752af56f41250ce
 
 Various clean ups suggested by Gregory Haskins, Dmitry Adamushko,
 and Peter Zijlstra.

ok, i like this new approach - nice work! I'd suggest we test it in -rt 
for some time and then merge it into v2.6.25?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 0/7] New RT Task Balancing -v2

2007-10-22 Thread Steven Rostedt
[
  Changes since V1:
Updated to git tree 55b70a0300b873c0ec7ea6e33752af56f41250ce

Various clean ups suggested by Gregory Haskins, Dmitry Adamushko,
and Peter Zijlstra.

Biggest change was recommended by Ingo Molnar. This is the use of cpusets
for keeping track of RT overloaded CPUS.  When CONFIG_CPUSETS is not
defined, we have a single global rt_overload_mask that keeps track
of the runqueues with more than one RT task queued. When CONFIG_CPUSETS
is configured, that bitmask is stored in the cpusets.

Note that in the case of overlapping cpusets it is possible to have
inconsistent data between the bitmask and actual RT overloaded runqueues.
The worst that can happen is that a task doesn't get moved quickly
over to a runnable CPU.  But this is a price we pay to keep from
dirtying caches for large number of CPU boxes. If this does happen
it gets cleaned up rather quickly since there are checks for
RT overload bits being set when they shouldn't be.

For most systems this is not an issue since a single cpuset is used.
]

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.
 
  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:   36  33   20041  39  33
 len:20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It