On Thu, 2006-06-22 at 13:19 +0700, Alain Fauconnet wrote:
> > Load is merely a count of how many processes are in the "ready to
> > run/waiting for CPU" state when the measurement is taken.  Processes
> > that are waiting on disk I/O are not ready to run (they will have a D in
> > the state field in top).  Processes that are doing a lot of I/O to a
> 
> I beg to differ: AFAIK load average includes processes in the 'D'
> (disk I/O wait) state. I don't quite have the time to dive in the
> kernel sources right now but I'm fairly positive it does. 

You are correct.  kernel/sched.c nr_active... it does count both
"running" and "uninterruptible".

But it's worth pointing out that my original description of the "D"
state is not entirely accurate.  This is "uninterruptable
sleep" (TASK_UNINTERRUPTIBLE), which in my experience is not disk i/o
(like using read(2)) but rather stuff more core, like waiting for swap,
or there being text segments in the binary/libraries that are on an NFS
mount that is not responding.  Technically, these are "disk i/o", but
they are not the kind of disk i/o that one would normally write into a
program, you'd be more likely to use read(2) and company, which normally
block and put the process into the S state.

I honestly can't remember the last time I saw a process in the D state
on a system that wasn't having some kind of problem (like some serious
thrashing).  This should show up as a non-zero value in the b column in
vmstat.  Non-zero b (blocked, I assume) values should correlate with
high si and so activity, and maybe high iowait percentage.  This makes
sense -- the kernel is processing disk-related tasks on behalf of a
process (in the D state), so the CPU is doing something (and
contributing to load), it's just not actually running any userspace
processes.  As far as swap goes, state D essentially means "this process
would be running if this machine had more RAM".


> I've found
> some references on the web conforting me in that opinion e.g.:
> 
> http://linux-ha.org/DRBD/FAQ:
> 
> ``Load average is defined as average number of processes in the runqueue
> during a given interval. A process is in the run queue, if it is
> 
>     *      not waiting for external events (e.g. select on some fd)
>     *      not waiting on its own (not called "wait" explicitly)
>     *      not stopped
> 
> Note that all processes waiting for disk io are counted as runnable!
> Therefore, if a lot of processes wait for disk io, the "load average"
> goes straight up, though the system actually may be almost idle
> cpu-wise''

Heh, really, the core of the problem of trying to talk about this is
that "a blocked process" means something different based on context, be
it using non-blocking I/O, uninterruptible sleep, what have you.  I got
caught up in the terms when trying to make the distinction between
userspace disk activity (solely the bi and bo columns) and kernelspace
disk activity (the b, swap and iowait columns, maybe bi and bo _also_ if
your swap is on a local disk).  The kernel and thus vmstat count a
process as blocking if it didn't explicitly ask to do something that
would make it not execute (things like read(2) are assumed to block, the
process goes into state S until the operation completes), but is rather
not executing because of the entire state of the system (having to wait
for swap).  The former is not fixable (assuming blocking io calls are a
"problem") without an application rewrite, the latter is fixable by
changing system hardware or tweaking system parameters or making sure
your NFS server doesn't go down (among other ways).  I guess another way
to put it might be that it somewhat depends on how the scheduler was
(re)entered: as the result of a system call, or as the result of a page
fault.


The next claim on that page is questionable...

        E.g. crash your nfs server, and start 100
        ls /path/to/non-cached/dir/on/nfs/mount-point on a client... you
        get a "load average" of 100+ for as long as the nfs timeout,
        which might be weeks ... though the cpu does nothing.

... as I can not reproduce this using kernel 2.6.16-1.2096_FC5 on a FC5
box to my 2.4.31-5trsmp trustix-2.2 NFS server.  Load did not increase
on the client at all, nor did the size of the run queue increase as
reported by vmstat, despite there being 100 ls processes blocking on a
stat64 call (according to strace) on an NFS mount point whose server was
dead.  vmstat reported no i/o activity at all, percentage wait was also
zero and the cpu was 100% idle.  b was zero -- of course, I don't think
I could have used strace on these successfully if they were in
uninterruptible sleep.

It may be different if one were to use a 2.4 NFS client, where the NFS
code puts the caller into D or if you use hard or soft mounts or mount
with the intr option (I'm using hard mounts and nointr, the defaults
according to nfs(5)).

That being said, trying to do ls to an unresponsive NFS server may not
be a good way to test this, since this FAQ entry is talking about
processes in the D state, and these ls processes were not in the D
state, but rather in the S state (sleeping) -- which makes sense, stat64
syscall ends up blocking waiting for the fs subsystem to perform the
operation.  Until then, the process is not put into a run queue and it's
interruptible sleep, so it doesn't count against the load average.


> I have for sure seen Linux boxes heavily trashing due to swapping or
> unreasonable disk I/O, with very little CPU load but load average in
> the 100s.

Yes, heavy swap usage matches the definition of "uninterruptible sleep"
when used to calculate load average.  I think we've confirmed this
now. :)


> [Rest deleted - no comments on these quite informative bits]

Thanks! But I'm not sure how much of this stuff I really know anyway. ;)

... er, sorry for the length.  Sometimes tangents can be interesting.

-- 
Andy Bakun <[EMAIL PROTECTED]>

_______________________________________________
tsl-discuss mailing list
[email protected]
http://lists.trustix.org/mailman/listinfo/tsl-discuss

Reply via email to