On Thu, 2006-06-22 at 13:19 +0700, Alain Fauconnet wrote: > > Load is merely a count of how many processes are in the "ready to > > run/waiting for CPU" state when the measurement is taken. Processes > > that are waiting on disk I/O are not ready to run (they will have a D in > > the state field in top). Processes that are doing a lot of I/O to a > > I beg to differ: AFAIK load average includes processes in the 'D' > (disk I/O wait) state. I don't quite have the time to dive in the > kernel sources right now but I'm fairly positive it does.
You are correct. kernel/sched.c nr_active... it does count both "running" and "uninterruptible". But it's worth pointing out that my original description of the "D" state is not entirely accurate. This is "uninterruptable sleep" (TASK_UNINTERRUPTIBLE), which in my experience is not disk i/o (like using read(2)) but rather stuff more core, like waiting for swap, or there being text segments in the binary/libraries that are on an NFS mount that is not responding. Technically, these are "disk i/o", but they are not the kind of disk i/o that one would normally write into a program, you'd be more likely to use read(2) and company, which normally block and put the process into the S state. I honestly can't remember the last time I saw a process in the D state on a system that wasn't having some kind of problem (like some serious thrashing). This should show up as a non-zero value in the b column in vmstat. Non-zero b (blocked, I assume) values should correlate with high si and so activity, and maybe high iowait percentage. This makes sense -- the kernel is processing disk-related tasks on behalf of a process (in the D state), so the CPU is doing something (and contributing to load), it's just not actually running any userspace processes. As far as swap goes, state D essentially means "this process would be running if this machine had more RAM". > I've found > some references on the web conforting me in that opinion e.g.: > > http://linux-ha.org/DRBD/FAQ: > > ``Load average is defined as average number of processes in the runqueue > during a given interval. A process is in the run queue, if it is > > * not waiting for external events (e.g. select on some fd) > * not waiting on its own (not called "wait" explicitly) > * not stopped > > Note that all processes waiting for disk io are counted as runnable! > Therefore, if a lot of processes wait for disk io, the "load average" > goes straight up, though the system actually may be almost idle > cpu-wise'' Heh, really, the core of the problem of trying to talk about this is that "a blocked process" means something different based on context, be it using non-blocking I/O, uninterruptible sleep, what have you. I got caught up in the terms when trying to make the distinction between userspace disk activity (solely the bi and bo columns) and kernelspace disk activity (the b, swap and iowait columns, maybe bi and bo _also_ if your swap is on a local disk). The kernel and thus vmstat count a process as blocking if it didn't explicitly ask to do something that would make it not execute (things like read(2) are assumed to block, the process goes into state S until the operation completes), but is rather not executing because of the entire state of the system (having to wait for swap). The former is not fixable (assuming blocking io calls are a "problem") without an application rewrite, the latter is fixable by changing system hardware or tweaking system parameters or making sure your NFS server doesn't go down (among other ways). I guess another way to put it might be that it somewhat depends on how the scheduler was (re)entered: as the result of a system call, or as the result of a page fault. The next claim on that page is questionable... E.g. crash your nfs server, and start 100 ls /path/to/non-cached/dir/on/nfs/mount-point on a client... you get a "load average" of 100+ for as long as the nfs timeout, which might be weeks ... though the cpu does nothing. ... as I can not reproduce this using kernel 2.6.16-1.2096_FC5 on a FC5 box to my 2.4.31-5trsmp trustix-2.2 NFS server. Load did not increase on the client at all, nor did the size of the run queue increase as reported by vmstat, despite there being 100 ls processes blocking on a stat64 call (according to strace) on an NFS mount point whose server was dead. vmstat reported no i/o activity at all, percentage wait was also zero and the cpu was 100% idle. b was zero -- of course, I don't think I could have used strace on these successfully if they were in uninterruptible sleep. It may be different if one were to use a 2.4 NFS client, where the NFS code puts the caller into D or if you use hard or soft mounts or mount with the intr option (I'm using hard mounts and nointr, the defaults according to nfs(5)). That being said, trying to do ls to an unresponsive NFS server may not be a good way to test this, since this FAQ entry is talking about processes in the D state, and these ls processes were not in the D state, but rather in the S state (sleeping) -- which makes sense, stat64 syscall ends up blocking waiting for the fs subsystem to perform the operation. Until then, the process is not put into a run queue and it's interruptible sleep, so it doesn't count against the load average. > I have for sure seen Linux boxes heavily trashing due to swapping or > unreasonable disk I/O, with very little CPU load but load average in > the 100s. Yes, heavy swap usage matches the definition of "uninterruptible sleep" when used to calculate load average. I think we've confirmed this now. :) > [Rest deleted - no comments on these quite informative bits] Thanks! But I'm not sure how much of this stuff I really know anyway. ;) ... er, sorry for the length. Sometimes tangents can be interesting. -- Andy Bakun <[EMAIL PROTECTED]> _______________________________________________ tsl-discuss mailing list [email protected] http://lists.trustix.org/mailman/listinfo/tsl-discuss
