Re: [PATCH] VESA framebuffer w/ MTRR locks 2.4.0 on init

2001-01-05 Thread David Wragg

Chris Kloiber <[EMAIL PROTECTED]> writes:
> last 2 lines in dmesg output:
> mtrr: 0xd800,0x200 overlaps existing 0xd800,0x100
> mtrr: 0xd800,0x200 overlaps existing 0xd800,0x100

Are you running XFree86-4.0.x?

> cat /proc/mtrr
> reg00: base=0x (   0MB), size= 256MB: write-back, count=1
> reg01: base=0xd800 (3456MB), size=  16MB: write-combining, count=1
> reg05: base=0xd000 (3328MB), size=  64MB: write-combining, count=1
>  
> 
> My video card is Voodoo3/3000/AGP and my motherboard is an MSI-6330
> (Athlon Tbird 800)
> I am experiencing text console video corruption. In tdfxfb mode or
> regular vesafb it looks like a horizontal line of color pixels that
> grows, in 'regular' text mode I get flashing characters or the font
> degrades into unreadable mess. X is fine.

What does "lspci -v" give?


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] VESA framebuffer w/MTRR locks 2.4.0 on init

2001-01-05 Thread David Wragg

Alan Cox <[EMAIL PROTECTED]> writes:
> > loop with no exit, as each size mtrr fails.
> >  while (mtrr_add(video_base, temp_size, MTRR_TYPE_WRCOMB,
> > 1)==-EINVAL) {
> >  temp_size >>= 1;
> >  }
> 
> Ok that one is the bug.

Even with the obvious bug fixed, that code is strange.  "temp_size >>=
1" does little to improve the chances of the mtrr_add succeeding.
Something like this would be better:

if (mtrr_add(video_base, temp_size, MTRR_TYPE_WRCOMB, 1) == -EINVAL) {
/* Find the largest power-of-two */
while (temp_size & (temp_size - 1))
temp_sze &= (temp_size - 1);

mtrr_add(video_base, temp_size, MTRR_TYPE_WRCOMB, 1);
}


(But this is just a very crude way to work around the inflexibility of
the MTRRs.  Rather than cluttering up calls to mtrr_add, it would be
better to fix this properly, either by implementing PAT support
(Zoltán Böszörményi said he was working on that), or by having a
user-space helper program to make more intelligent MTRR allocations,
or both.)


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: MTRR type AMD Duron/intel ?

2001-01-15 Thread David Wragg

David Balazic <[EMAIL PROTECTED]> writes:
> A recent 2.4.0 ( not the final , but close  ) kernel prints this :
> 
> mtrr: detected mtrr type: intel
> 
> I have an AMD K7 Duron 700 CPU
> 
> Is this correct ?

Yes.  The K7 supports MTRRs exactly according to the Intel specs, as
opposed to the MTRR-like but somewhat different features that some
other x86 CPUs implement.  So while it may appear odd, it is correct.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



x86 local timer interrupts getting lost

2000-11-30 Thread David Wragg

I've noticed that on dual processor machines, the two values in the
LOC: line of /proc/interrupts are not in lockstep -- the difference
between them varies.  Using a perl script (below) to print out the
difference every second, I see it wander around (by much more than the
+/-1 error that /proc/interrupts involves).  Sometimes the difference
will jump, usually when the machine is under heavier interrupt load.
I've seen jumps of more than 10, up and down on the same machine.

The machines I've been testing this on are a dual PPro and a dual
Celeron, both running 2.4.0-test11.

Looking at the APIC documentation, it seems unlikely that the
frequency of the local timer interrupts could be wandering differently
on the two processors, so I suspect that the interrupts are actually
getting lost.  Does anyone know how this could be happening?


The perl script:

perl -e 'while(1) { open(IN, ") { next unless /LOC:/; 
($a, $b, $c) = split; } close(IN); print (($b - $c) . "\n"); sleep(1); }'



David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread David Wragg

Gregory Maxwell <[EMAIL PROTECTED]> writes:
> If 2.96 is broken, I'd appreciate it if you would describe the breakage. 

As in the RedHat 2.96?  Try compiling the following on RedHat 7.0 x86
with "gcc -O2" and take a look at the generated code.  Nice, isn't it?


#include 

void foo(void)
{
struct itimerval iv;

iv.it_interval.tv_sec = 0;
iv.it_interval.tv_usec = 25;
iv.it_value = iv.it_interval;

setitimer(ITIMER_REAL, &iv, NULL);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread David Wragg

Aaron Sethman <[EMAIL PROTECTED]> writes:
> Try compiling the said code with -fno-strict-aliasing, and your problems
> will be solved.

Yes, but I don't think I should have to give gcc flags to get it to
obey the C standard (my example can easily be turned into a
self-contained strictly conforming program, in order to qualify for
the full weight of the standard).

>  gcc is doing the right thing, just not what you expected.

gcc is not doing the right thing.  My example contains no type punning
or other deviations from ISO C which might warrant
-fno-strict-aliasing.  The program should behave as if the assignments
are evaluated sequentially; with this compiler, it does not.

> The kernel already checks to see if gcc can grok -fno-strict-aliasing

Yes, since the kernel needs to say to gcc "I know this code relies on
more than the ISO C semantics, so please be gentle with it".


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



What protects f_pos?

2000-11-04 Thread David Wragg

Since f_pos of struct file is a loff_t, on 32-bit architectures it
needs a lock to make accesses atomic (or some more sophisticated form
of protection).  But looking in 2.4.0-test10, there doesn't seem to be
any such lock.

The llseek op is called with the Big Kernel Lock, but unlike in 2.2,
the read and write ops are called without any locks held, and so
generic_file_{read|write} make unprotected accesses to f_pos (through
their ppos argument).

Is this something for Ted's todo list?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: What protects f_pos?

2000-11-11 Thread David Wragg

[EMAIL PROTECTED] writes:
> This looks like it's a bug to me  although if you have multiple
> threads hitting a file descriptor at the same time, you're pretty much
> asking for trouble.

Yes, I haven't been able to come up with an example that might trigger
this that wasn't dubious to begin with.  I'll raise this again at a
convenient time during 2.5.

David

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: What protects f_pos?

2000-11-13 Thread David Wragg

"David Schwartz" <[EMAIL PROTECTED]> writes:
>   Suppose you had a multithreaded program that used a
> configuration file with a group of fixed-length blocks indicating what
> work it had to do. Each thread read a block from the file and then did
> the work. One might think that there's no need to protect the file
> descriptor with a mutex.

I don't think that this will work, due to a separate non-atomicity
issue with f_pos.  The generic file read and write implementations do
not atomically update f_pos.  They read f_pos to determine the file
offset to use, then manipulate the page cache (possibly sleeping on
I/O), and only then set f_pos to the appropriate updated value.  So
the example you suggest, with two threads, could do something like:

  Thread 1   Thread 2

   sys_read(fd, buf1, len)

  off = file->f_possys_read(fd, buf2, len)

 read to buf1 off = file->f_pos

file->f_pos = off + len read to buf2

file->f_pos = off + len


So both threads read the same block, and f_pos only gets incremented
once.

(Pipes and sockets are a different matter, of course.)

2.2 has the same issue, since although the BKL is held, it will get
dropped if one of the threads sleeps on I/O.  (Earlier Linux versions
might well have the same issue, but I don't have the source around to
check.)

POSIX doesn't seem to bar this behaviour.  From 6.4.1.2:

On a regular file or other file capable of seeking, read() shall
start at a position in the file given by the file offset
associated with fildes.  Before successful return from read(), the
file offset shall be incremented by the number of bytes actually
read.

Which is exactly what Linux does.  I can't find text anywhere else in
POSIX.1 that strengthens that condition for the case of multiple
processes/threads reading from the same file.  I'll try to find out
what the Austin Group has to say about this.


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.2.18pre25, S3, AMD K6-2, and MTRR....

2000-12-10 Thread David Wragg

"Victor J. Orlikowski" <[EMAIL PROTECTED]> writes:
> This is precisely my problem.
> K6-2, model 8, stepping 12.
> Thus far, everything is *fine*, as long as MTRR is not compiled into
> the kernel.
> If MTRR is compiled into the kernel, I get lock-ups in X *only*, and
> the entire machine locks.

Just to check the important facts (correct any of this if it is
wrong): In 2.2.18, the problem appears when you enable MTRR support,
but goes away when you disable MTRR support.  You are using the vesafb
driver.  You are running XFree86-3.x.

Are you passing any vesafb options on the kernel command line?

If not, this is very strange, because the 2.2.18 mtrr.c (or any other)
should not be touching the MTRR registers (or whatever the K6 calls
them) unless you do something to /proc/mtrr.

If I understood why the MTRR driver was doing something on the K6-2,
then model-specific differences might make some sense.  But currently,
I don't see why there would be any difference between "MTRR disabled"
and "MTRR enabled, but not used".


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.2.18pre25, S3, AMD K6-2, and MTRR....

2000-12-11 Thread David Wragg

Steven Walter <[EMAIL PROTECTED]> writes:
> On Sun, Dec 10, 2000 at 06:20:31PM +0000, David Wragg wrote:
> > If I understood why the MTRR driver was doing something on the K6-2,
> > then model-specific differences might make some sense.  But currently,
> > I don't see why there would be any difference between "MTRR disabled"
> > and "MTRR enabled, but not used".
> 
> If I'm not mistaken, X /does/ touch the MTRR's, which would explain why
> it is X that crashes.

Only in XFree86-4.x (I never distributed my MTRR patches for
XFree86-3.x ;-).  Which is why the XFree86 version was one of the
things I wanted to confirm.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.2.16 SMP: mtrr errors

2000-12-12 Thread David Wragg

" Paul C. Nendick " <[EMAIL PROTECTED]> writes:
> Shall I submit this to Matrox as a bug then?

The "bug" is in the XFree86 core, so telling Matrox might not do a lot
of good.

The driver code just says "I want to map a framebuffer of this size at
this physical address" (or actually "with these PCI details"), and the
core code arranges the mapping, doing the MTRR stuff while it's at it.

David Wragg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is there a Linux trademark issue with sun?

2000-12-16 Thread David Wragg

Rob Landley <[EMAIL PROTECTED]> writes:
> Remember, Linux uses
> Posix (although we don't SAY posix much because that's
> another trademark and nobody jumps through the hoops
> to re-test each new conbination of kernel version X
> with utility set Y).

POSIX is not a trademark.  The name refers to an IEEE/ISO/IEC
standard.

You don't *have* to run any tests to claim that Linux (+ libc +
utilities) conforms to POSIX.  But if you don't run a suitable test
suite, how can you be confident that it does conform to POSIX?

(The origin of the term POSIX may be a surprise to some.  See
http://www.linuxcare.com/viewpoints/os-interviews/12-14-99.epl>).


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread David Wragg

Tigran Aivazian <[EMAIL PROTECTED]> writes:
> b) it detects all memory correctly but creates a write-back mtrr only for
> the first 2G, is this normal?

mtrr.c is broken for machines with >=4GB of memory (or less than 4GB,
if the chipset reserves an addresses range below 4GB for PCI).

The patch against 2.4.0-test9 to fix this is below.

Richard: Is there a reason you haven't passed this on to Linus, or do
you want me to do it?


Dave



diff -rua linux-2.4.0test9/arch/i386/kernel/mtrr.c 
linux-2.4.0test9.mod/arch/i386/kernel/mtrr.c
--- linux-2.4.0test9/arch/i386/kernel/mtrr.cWed Oct 11 19:54:56 2000
+++ linux-2.4.0test9.mod/arch/i386/kernel/mtrr.cWed Oct 11 20:48:26 2000
@@ -503,9 +503,9 @@
 static void intel_get_mtrr (unsigned int reg, unsigned long *base,
unsigned long *size, mtrr_type *type)
 {
-unsigned long dummy, mask_lo, base_lo;
+unsigned long mask_lo, mask_hi, base_lo, base_hi;
 
-rdmsr (MTRRphysMask_MSR(reg), mask_lo, dummy);
+rdmsr (MTRRphysMask_MSR(reg), mask_lo, mask_hi);
 if ( (mask_lo & 0x800) == 0 )
 {
/*  Invalid (i.e. free) range  */
@@ -515,20 +515,17 @@
return;
 }
 
-rdmsr(MTRRphysBase_MSR(reg), base_lo, dummy);
+rdmsr(MTRRphysBase_MSR(reg), base_lo, base_hi);
 
-/* We ignore the extra address bits (32-35). If someone wants to
-   run x86 Linux on a machine with >4GB memory, this will be the
-   least of their problems. */
+/* Work out the shifted address mask. */
+mask_lo = 0xff00 | mask_hi << (32 - PAGE_SHIFT)
+   | mask_lo >> PAGE_SHIFT;
 
-/* Clean up mask_lo so it gives the real address mask. */
-mask_lo = (mask_lo & 0xf000UL);
 /* This works correctly if size is a power of two, i.e. a
contiguous range. */
-*size = ~(mask_lo - 1);
-
-*base = (base_lo & 0xf000UL);
-*type = (base_lo & 0xff);
+*size = -mask_lo;
+*base = base_hi << (32 - PAGE_SHIFT) | base_lo >> PAGE_SHIFT;
+*type = base_lo & 0xff;
 }   /*  End Function intel_get_mtrr  */
 
 static void cyrix_get_arr (unsigned int reg, unsigned long *base,
@@ -553,13 +550,13 @@
 /* Enable interrupts if it was enabled previously */
 __restore_flags (flags);
 shift = ((unsigned char *) base)[1] & 0x0f;
-*base &= 0xf000UL;
+*base >>= PAGE_SHIFT;
 
 /* Power of two, at least 4K on ARR0-ARR6, 256K on ARR7
  * Note: shift==0xf means 4G, this is unsupported.
  */
 if (shift)
-  *size = (reg < 7 ? 0x800UL : 0x2UL) << shift;
+  *size = (reg < 7 ? 0x1UL : 0x40UL) << (shift - 1);
 else
   *size = 0;
 
@@ -596,7 +593,7 @@
 /*  Upper dword is region 1, lower is region 0  */
 if (reg == 1) low = high;
 /*  The base masks off on the right alignment  */
-*base = low & 0xFFFE;
+*base = (low & 0xFFFE) >> PAGE_SHIFT;
 *type = 0;
 if (low & 1) *type = MTRR_TYPE_UNCACHABLE;
 if (low & 2) *type = MTRR_TYPE_WRCOMB;
@@ -621,7 +618,7 @@
  * *128K   ...
  */
 low = (~low) & 0x1FFFC;
-*size = (low + 4) << 15;
+*size = (low + 4) << (15 - PAGE_SHIFT);
 return;
 }   /*  End Function amd_get_mtrr  */
 
@@ -634,8 +631,8 @@
 static void centaur_get_mcr (unsigned int reg, unsigned long *base,
 unsigned long *size, mtrr_type *type)
 {
-*base = centaur_mcr[reg].high & 0xf000;
-*size = (~(centaur_mcr[reg].low & 0xf000))+1;
+*base = centaur_mcr[reg].high >> PAGE_SHIFT;
+*size = -(centaur_mcr[reg].low & 0xf000) >> PAGE_SHIFT;
 *type = MTRR_TYPE_WRCOMB;  /*  If it is there, it is write-combining  */
 }   /*  End Function centaur_get_mcr  */
 
@@ -665,8 +662,10 @@
 }
 else
 {
-   wrmsr (MTRRphysBase_MSR (reg), base | type, 0);
-   wrmsr (MTRRphysMask_MSR (reg), ~(size - 1) | 0x800, 0);
+   wrmsr (MTRRphysBase_MSR (reg), base << PAGE_SHIFT | type,
+  (base & 0xf0) >> (32 - PAGE_SHIFT));
+   wrmsr (MTRRphysMask_MSR (reg), -size << PAGE_SHIFT | 0x800,
+  (-size & 0xf0) >> (32 - PAGE_SHIFT));
 }
 if (do_safe) set_mtrr_done (&ctxt);
 }   /*  End Function intel_set_mtrr_up  */
@@ -680,7 +679,9 @@
 arr = CX86_ARR_BASE + (reg << 1) + reg; /* avoid multiplication by 3 */
 
 /* count down from 32M (ARR0-ARR6) or from 2G (ARR7) */
-size >>= (reg < 7 ? 12 : 18);
+if (reg >= 7)
+   size >>= 6;
+
 size &= 0x7fff; /* make sure arr_size <= 14 */
 for(arr_size = 0; size; arr_size++, size >>= 1);
 
@@ -705,6 +706,7 @@
 }
 
 if (do_safe) set_mtrr_prepare (&ctxt);
+base <<= PAGE_SHIFT;
 setCx86(arr,((unsigned char *) &base)[3]);
 setCx86(arr+1,  ((unsigned char *) &base)[2]);
 setCx86(arr+2, (((unsigned char *) &base)[1]) | arr_size);
@@ -724,34 +726,36 @@
 [RETURNS] Nothing.
 */
 {
-u32 low, high;
+u32 regs[2];
 struct set_mtrr_context ctxt;
 
 if (do_safe) set_mtrr_prepare (&ctxt);
 /*
   

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Richard Gooch <[EMAIL PROTECTED]> writes:
> David Wragg writes:
> > mtrr.c is broken for machines with >=4GB of memory (or less than 4GB,
> > if the chipset reserves an addresses range below 4GB for PCI).
> > 
> > The patch against 2.4.0-test9 to fix this is below.
> > 
> > Richard: Is there a reason you haven't passed this on to Linus, or do
> > you want me to do it?
> 
> Partly because I haven't had time to look at it, partly because I'm
> not sure if it's needed (why, exactly?)

Because mtrr.c throws away the top 4 bits of 36-bit physical
addresses, it gives misleading /proc/mtrr output on machines with
>=4GB of memory, which I think requires a fix on its own.  But worse,
if it tries to make MTRR changes on such a machine, you can get bogus
MTRR settings. This can ruin a machine's performance (if real memory
ends up write combined or uncached) or give hardware instabilities (if
a device's MMIO area gets the wrong memory type).

So far, this probably hasn't bitten too many people, since relatively
few Linux x86 users have >=4GB memory, and /proc/mtrr hasn't usually
been altered without explicit intervention.  But with XFree86-4
finally "out there" and more kernel drivers using MTRRs, this can only
get worse.

(Whether Tigran's performance problems are actually down to the mtrr.c
issue, I don't know.  It's not worth hypothesizing until we have
accurate /proc/mtrr output.)

When I checked the 2.2 version of my patch, it didn't involve a
significant increase in code size.

> and partly because I've
> recently moved house and (STILL!) don't have IP access at home (not
> even dialup) so I can't really look at stuff yet 

Ok.  I'll wait for feedback from Tigran, and if I don't get anything
negative I'll submit to Linus.  The 2.2 version of my patch fixes
problems for other people, VA Linux have included it in their kernel
for a while with no problems that have been reported back to me), and
it's silly that it isn't in 2.4testX.  I should have addressed this a
while ago, but I have my own distractions from kernel hacking.

Later on, you can send a mtrr.c maintenance patch, if you like.

I've just caught up on this whole thread, and I don't have any
objections in principle to Zoltan's patch being used instead of mine,
though I'd like to take a look at it first.

Regards,
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: IRQ affinity vs. MTRRs, was Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Boszormenyi Zoltan <[EMAIL PROTECTED]> writes:
> I came up with an idea. The MTRRs are per-cpu things.
> Ingo Molnar's IRQ affinity code helps binding certain
> IRQ sources to certain CPUs.

They are implemented as per-cpu things but the Intel manuals say that
all cpus should have the same MTRR settings.  They also give
pseudo-code for how to update them on an SMP system, which mtrr.c
follows.

If the BIOS has set them up differently at boot time, mtrr.c will
complain and copy the MTRR settings of CPU0 to the others.

Regards,
David Wragg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: IRQ affinity vs. MTRRs, was Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Boszormenyi Zoltan <[EMAIL PROTECTED]> writes:
> The idea is that when it is sure that _only one_ (or some) CPU will access
> a PCI card's mmio area then only that CPU's (those CPUs') MTRRs needs to
> contain an entry for that area.
>
> Although there are (must be) common MTRR entries for the main memory
> and the commonly accessed mmio register areas.
> 
> The idea came because fiddling with MTRRs quickly revaled that
> only 8 variable ones exist.

I see.  I think there is a more straightforward solution: PAT does the
same thing as MTRRs, but has no such "number of ranges" limitation ---
it lets you set the memory type on a page-by-page basis.  If the
number of MTRRs becomes a problem (anyone know how many the P4 has?),
then the real solution is to implement PAT support.

IIRC, only the PPro, the first PII model (Klamath?), and the first
Celeron model have MTRR but not PAT (Athlon has PAT, but /proc/cpuinfo
misreports it as "fcmov", at least in 2.2.14; Xeons always had PAT).


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: quicksort for linked list

2001-03-10 Thread David Wragg

[EMAIL PROTECTED] (Rogier Wolff) writes:
> Quicksort however is an algorithm that is recursive. This means that
> it can use unbounded amounts of stack -> This is not for the kernel.

The implementation of Quicksort for arrays demands a recursive
implementation, but for doubly-linked lists there is a trick that
leads to an iterative implementation.  You can implement Quicksort
recursively for singly linked lists, so in a doubly-linked list you
have a spare link in each node while you are doing the sort.  You can
hide the stack in those links, so the implementation doesn't need to
be explicitly recursive.  At the end of the sort, the "next" links are
correct, so you have to go through and fix up the "prev" links.

> Quicksort however is an algorithm that is good for large numbers of
> elements to be sorted: the overhead of a small set of items to sort is
> very large. Is the "normal" case indeed "large sets"?

Good implementations of Quicksort actually give up on Quicksort when
the list is short, and use an algorithm that is faster for that case
(measurements are required to find out where the boundary between a
short list and a long list lies).  If the full list to be sorted is
short, Quicksort will never be involved.  If that happens to be the
common case, then fine.

> Quicksort has a very bad "worst case": quadratic sort-time. Are you
> sure this won't happen?

Introsort avoids this by modifying quicksort to resort to a mergesort
when the recursion gets too deep.

For modern machines, I'm not sure that quicksort on a linked list is
typically much cheaper than mergesort on a linked list.  The majority
of the potential cost is likely to be in the pointer chasing involved
in bringing the lists into cache, and that will be the same for both.
Once the list is in cache, how much pointer fiddling you do isn't so
important.  For lists that don't fit into cache, the advantages of
mergesort should become even greater if the literature on tape and
disk sorts applies (though multiway merges rather than simple binary
merges would be needed to minimize the impact of memory latency).

Given this, mergesort might be generally preferable to quicksort for
linked lists.  But I haven't investigated this idea thoroughly.  (The
trick described above for avoiding an explicit stack also works for
mergesort.)

> Isn't it easier to do "insertion sort": Keep the lists sorted, and
> insert the item at the right place when you get the new item.

Easier?  Yes.  Slower?  Yes.  Does its being slow matter?  Depends on
the context.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: another Cyrix/mtrr problem?

2001-03-14 Thread David Wragg

[EMAIL PROTECTED] (Bob_Tracy) writes:
> Unfortunately, when I execute
> 
> echo "base=0xd800 size=0x10 type=write-combining" >| /proc/mtrr
> 
> I get a 2MB region instead of the 1MB region I expected...

Oops, it got broken by the MTRR >32-bit support in 2.4.0-testX.  The
patch below should fix it.

Joerg, I think this might well fix your Cyrix mtrr problem also.

Let me know how it goes,
Dave Wragg


diff -uar linux-2.4.2/arch/i386/kernel/mtrr.c linux-2.4.2.cyrix/arch/i386/kernel/mtrr.c
--- linux-2.4.2/arch/i386/kernel/mtrr.c Thu Feb 22 15:24:53 2001
+++ linux-2.4.2.cyrix/arch/i386/kernel/mtrr.c   Wed Mar 14 22:28:02 2001
@@ -538,7 +538,7 @@
  * Note: shift==0xf means 4G, this is unsupported.
  */
 if (shift)
-  *size = (reg < 7 ? 0x1UL : 0x40UL) << shift;
+  *size = (reg < 7 ? 0x1UL : 0x40UL) << (shift - 1);
 else
   *size = 0;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



cannot mount later cdrom sessions with 2.4.x

2001-03-16 Thread David Wragg

Is multisession CDROM support broken in 2.4.x?

I have an "Enhanced CD" which has a bunch of audio tracks followed by
a data track (is this the same as CD-XA? I can't remember).  Under
2.2, I can mount the iso9660 filesystem on the data track without
trouble, using the session option:

# mount -o session=19 /mnt/cdrom

But under 2.4.2, the mount fails with this in the kernel log:

Session 20 start 230045 type 4
attempt to access beyond end of device
16:00: rw=0, want=460123, limit=61884
isofs_read_super: bread failed, dev=16:00, iso_blknum=230061, block=460122

It looks like the blk_size entry doesn't get updated to reflect the
fact that isofs has issued an ioctl to switch to a later session on
the disc.

The drive I'm using is:
hdc: MATSHITADVD-ROM SR-8174, ATAPI CD/DVD-ROM drive

Dave Wragg
(hoping that the QuickTime movies on the data track use an xanim
supported codec, but not optimistic)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Not MTRR !? was: ISSUE: very slow (factor 100) 4-way 16GByte server, with 2.4.2

2001-03-28 Thread David Wragg

Rik van Riel <[EMAIL PROTECTED]> writes:
> On Wed, 28 Mar 2001, Robert Suetterlin wrote:
> > reg00: base=0xfb00 (4016MB), size=  16MB: uncachable, count=1
> > reg01: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1
> > reg02: base=0x (   0MB), size=8192MB: write-back, count=1
> > reg03: base=0x2 (8192MB), size=4096MB: write-back, count=1
> > reg04: base=0x3 (12288MB), size=2048MB: write-back, count=1
> > reg05: base=0x38000 (14336MB), size=1024MB: write-back, count=1
> > reg06: base=0x3c000 (15360MB), size= 512MB: write-back, count=1
> > reg07: base=0x3e000 (15872MB), size= 256MB: write-back, count=1
>--  +
>   15.75 GB
> 
> It looks like the last 256MB isn't cached (well, it doesn't
> have an MTRR at all) and Linux starts loading programs from
> the top of memory ...

It looks like the BIOS ran out of MTRRs.  I suspect this is one of the
reasons that Intel changed the PPro spec to allow overlapping MTRRs in
some cases, with uncached taking precedence.  The following sequence
of /proc/mtrr commands should give the same uncachable range with all
memory write-back cached:

# cat >/proc/mtrr
disable=2
disable=3
disable=4
disable=5
disable=6
disable=7
base=0 size=0x4 type=write-back
base=0x4 size=0x400 type=write-back
base=0x40400 size=0x100 type=write-back
^D

(I think all those zeros are correct. 0x4 == 16GB, 0x400
== 64MB, 0x100 == 16MB)

It's probably also worth seeing if a BIOS update is available.


Dave Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Solved with MTRR was: ISSUE: very slow (factor 100) 4-way 16GByte server, with 2.4.2

2001-03-29 Thread David Wragg

Robert Suetterlin <[EMAIL PROTECTED]> writes:
> 2. I was not allowed to do `base=0 size=0x4
> type=write-back`, because of the overlap with the memory range at
> base=0x0fb00. 

/proc/mtrr does allow overlapping regions in some cases, but the
conditions turned out to be stricter than I remembered.  You have to
create the enclosing range first, which makes the facility useless in
this case (perhaps in all potentially useful cases).

> So what I do is only disable 3-7, and then
> base=0x4 size=0x4.

Yes, that solution should be safe.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



limit on number of kmapped pages

2001-01-23 Thread David Wragg

While testing some kernel code of mine on a machine with
CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages
that can be kmapped at once.  I was surprised to find it was so low --
only 2MB/4MB of address space for kmap (according to the value of
LAST_PKMAP; vmalloc gets a much more generous 128MB!).

My code allocates a large number of pages (4MB-worth would be typical)
to act as a buffer; interrupt handlers/BHs copy data into this buffer,
then a kernel thread moves filled pages into the page cache and
replaces them with newly allocated pages.  To avoid overhead on
IRQs/BHs, all the pages in the buffer are kmapped.  But with
CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel
locks up (fork() starts blocking inside kmap(), etc.).

There are ways I could work around this (either by using kmap_atomic,
or by adding another kernel thread that maintains a window of kmapped
pages within the buffer).  But I'd prefer not to have to add a lot of
code specific to the CONFIG_HIGHMEM case.

So why is LAST_PKMAP so low, and what would the consequences of
raising it be?

(I don't think kernel address space is that scarce in the
CONFIG_HIGHMEM case, so I suspect that the main reason is to limit the
amount of searching needed for kmap to find a free slot.  Is this
right?)


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: limit on number of kmapped pages

2001-01-23 Thread David Wragg

[EMAIL PROTECTED] (Eric W. Biederman) writes:
> Why do you need such a large buffer? 

ext2 doesn't guarantee sustained write bandwidth (in particular,
writing a page to an ext2 file can have a high latency due to reading
the block bitmap synchronously).  To deal with this I need at least a
2MB buffer.

I've modifed ext2 slightly to avoid that problem, but I still expect
to need a 512KB buffer (though the usual requirements are much lower).
While that wouldn't hit the kmap limit, it would bring the system
closer to it.

Perhaps further tuning could reduce the buffer needs of my
application, but it is better to have the buffer too big than too
small.

> And why do the pages need to be kmapped? 

They only need to be kmapped while data is being copied into them.

> If you are doing dma there is no such requirement...  And
> unless you are running on something faster than a PCI bus I can't
> imagine why you need a buffer that big. 

Gigabit ethernet.

> My hunch is that it makes
> sense to do the kmap, and the i/o in the bottom_half.  What is wrong
> with that?

Do you mean kmap_atomic?  The comments around kmap don't mention
avoiding it in BHs, but I don't see what prevents kmap -> kmap_high ->
map_new_virtual -> schedule.

> kmap should be quick and fast because it is for transitory mappings.
> It shouldn't be something whose overhead you are trying to avoid.  If
> kmap is that expensive then kmap needs to be fixed, instead of your
> code working around a perceived problem.
> 
> At least that is what it looks like from here.

When adding the kmap/kunmap calls to my code I arranged them so they
would be used as infrequently as possible.  After working on making
the critical paths in my code fast, I didn't want to add operations
that have an uncertain cost into those paths unless there is a good
reason.  Which is why I'm asking how significant the kmap limit is.



David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ioremap_nocache problem?

2001-01-23 Thread David Wragg

From: David Wragg <[EMAIL PROTECTED]>
Gcc: nnfolder:mail.sent
--text follows this line--
Roman Zippel <[EMAIL PROTECTED]> writes:
> On Tue, 23 Jan 2001, Mark Mokryn wrote:
> > ioremap_nocache does the following:
> > return __ioremap(offset, size, _PAGE_PCD);

You have a point.

It would be nice if ioremap took a argument indicating the desired
memory type -- normal, nocache, write-through, write-combining, etc.
Then it could look in an architecture-specific table to get the
appropriate page flags for that type.

(x86 processors with PAT and IA64 can set write-combining through page
flags.  x86 processors with MTRRs but not PAT would need a more
elaborate implementation for write-combining.)

> > 
> > However, in drivers/char/mem.c (2.4.0), we see the following:
> > 
> > /* On PPro and successors, PCD alone doesn't always mean 
> > uncached because of interactions with the MTRRs. PCD | PWT
> > means definitely uncached. */ 
> > if (boot_cpu_data.x86 > 3)
> > prot |= _PAGE_PCD | _PAGE_PWT;
> > 
> > Does this mean ioremap_nocache() may not do the job?
> 
> ioremap creates a new mapping that shouldn't interfere with MTRR, whereas
> you can map a MTRR mapped area into userspace. But I'm not sure if it's
> correct that no flag is set for boot_cpu_data.x86 <= 3...

The boot_cpu_data.x86 > 3 test is there because the 386 doesn't have
PWT.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ioremap_nocache problem?

2001-01-23 Thread David Wragg

Timur Tabi <[EMAIL PROTECTED]> writes:
> ** Reply to message from Roman Zippel <[EMAIL PROTECTED]> on
> Tue, 23 Jan 2001 19:12:36 +0100 (MET)
> > ioremap creates a new mapping that shouldn't interfere with MTRR,
> >whereas you can map a MTRR mapped area into userspace. But I'm not
> >sure if it's correct that no flag is set for boot_cpu_data.x86 <=
> >3...
> 
> I was under the impression that the "don't cache" bit that
> ioremap_nocache sets overrides any MTRR.

Nope.  There's a table explaining how page flags and MTRRs interact in
the Intel x86 manual, volume 3 (it's in section 9.5.1 "Precedence of
Cache Controls" in the fairly recent edition I have here).

For example, with PCD set, PWT clear, and the MTRRs saying WC, the
effective memory type is WC.  In addition, there's a note saying this
may change in future models.  So you have to set PCD | PWT if you want
to get uncached in all cases.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: limit on number of kmapped pages

2001-01-24 Thread David Wragg

"Benjamin C.R. LaHaise" <[EMAIL PROTECTED]> writes:
> On 24 Jan 2001, David Wragg wrote:
> 
> > [EMAIL PROTECTED] (Eric W. Biederman) writes:
> > > Why do you need such a large buffer? 
> > 
> > ext2 doesn't guarantee sustained write bandwidth (in particular,
> > writing a page to an ext2 file can have a high latency due to reading
> > the block bitmap synchronously).  To deal with this I need at least a
> > 2MB buffer.
> 
> This is the wrong way of going about things -- you should probably insert
> the pages into the page cache and write them into the filesystem via
> writepage. 

I currently use prepare_write/commit_write, but I think writepage
would have the same issue: When ext2 allocates a block, and has to
allocate from a new block group, it may do a synchronous read of the
new block group bitmap.  So before the writepage (or whatever) that
causes this completes, it has to wait for the read to get picked by
the elevator, the seek for the read, etc.  By the time it gets back to
writing normally, I've buffered a couple of MB of data.

But I do have a workaround for the ext2 issue.

> That way the pages don't need to be mapped while being written
> out.

Point taken, though the kmap needed before prepare_write is much less
significant than the kmap I need to do before copying data into the
page.

> For incoming data from a network socket, making use of the
> data_ready callbacks and directly copying from the skbs in one pass with a
> kmap of only one page at a time.
>
> Maybe I'm guessing incorrect at what is being attempted, but kmap should
> be used sparingly and as briefly as possible.

I'm going to see if the one-page-kmapped approach makes a measurable
difference.

I'd still like to know what the basis for the current kmap limit
setting is.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ioremap_nocache problem?

2001-01-24 Thread David Wragg

Timur Tabi <[EMAIL PROTECTED]> writes:
> ** Reply to message from David Wragg <[EMAIL PROTECTED]> on 24 Jan 2001
> 00:50:20 +
> > (x86 processors with PAT and IA64 can set write-combining through
> >page flags.  x86 processors with MTRRs but not PAT would need a more
> >elaborate implementation for write-combining.)
> 
> What is PAT?  I desperately need to figure out how to turn on write
> combining on a per-page level.  I thought I had to use MTRRs, but now
> you're saying I can use this "PAT" thing instead.  Please explain!

PAT is basically the MTRR memory types on a per-page basis.  It adds a
new flag bit to the x86 page table entry, then that bit together with
the PCD and PWT bits is used to do a look-up in an 8-entry table that
gives the effective memory type (the table is set through an MSR).
All the details are in the Intel x86 manual, volume 3
http://developer.intel.com/design/pentium4/manuals/> (at the end
of chapter 9).

Quite a lot of the x86 CPUs out there support PAT: The PII except the
first couple of models, the Celeron except the first model, the PIII,
all PII and PIII Xeons, the P4, all AMD K7 models.  I'm guessing, but
I suspect that the majority of x86 CPUs supporting write combining in
any form that have been made also support PAT.

I wish Intel had put PAT in the PPro, rather than messing everyone
around with MTRRs (MTRRs are good for BIOS writers, but a pain for
everyone else).


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: limit on number of kmapped pages

2001-01-25 Thread David Wragg

"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes:
> On Wed, Jan 24, 2001 at 12:35:12AM +, David Wragg wrote:
> > 
> > > And why do the pages need to be kmapped? 
> > 
> > They only need to be kmapped while data is being copied into them.
> 
> But you only need to kmap one page at a time during the copy.  There
> is absolutely no need to copy the whole chunk at once.

The chunks I'm copying are always smaller than a page.  Usually they
are a few hundred bytes.

Though because I'm copying into the pages in a bottom half, I'll have
to use kmap_atomic.  After a page is filled, it is put into the page
cache.  So they have to be allocated with page_cache_alloc(), hence
__GFP_HIGHMEM and the reason I'm bothering with kmap at all.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-20 Thread David Wragg
Arjan van de Ven <[EMAIL PROTECTED]> writes:
> if all you care is the number of context switches, you can use the
> following system tap script as well:
>
> http://www.fenrus.org/cstop.stp

Thanks, something similar to that might well have solved my original
problem.  

(When I try the script, stap complains about the lack of the kernel
debuginfo package, which of course doesn't exist for my self-built
kernel.  After hunting around on the web for 10 minutes, I'm still no
closer to resolving this.  But I look forward to playing with
systemtap once I get past that problem.)

Nonetheless, while systemtap might provide an objection to adding
per-task context switch counters to the kernel, it doesn't answer the
question, since we do have these counters, why not expose them in the
normal way?


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-20 Thread David Wragg
Arjan van de Ven <[EMAIL PROTECTED]> writes:
> On Wed, 2006-12-20 at 14:38 +0000, David Wragg wrote:
>> (When I try the script, stap complains about the lack of the kernel
>> debuginfo package, which of course doesn't exist for my self-built
>> kernel.  After hunting around on the web for 10 minutes, I'm still no
>> closer to resolving this.  But I look forward to playing with
>> systemtap once I get past that problem.)
>
> what worked for me is copying the "vmlinux" file to /boot as
> /boot/vmlinux-`uname -r`

Thanks, that's got it working.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-23 Thread David Wragg
"Albert Cahalan" <[EMAIL PROTECTED]> writes:
> The cumulative ones are still not justified though, and I fear they
> may be 64-bit even on i386.

All the context switch counts are unsigned long.

> It turns out that an i386 procps spends
> much of its time doing 64-bit division to parse the damn ASCII crap.
> I suppose I could just skip those fields, but generating them isn't
> too cheap and probably I'd get stuck parsing them for some other
> reason -- having them separate is probably a good idea.

I can't think of a compelling justification for the cumulative context
switch counts.  But I suggest that if the cost of exposing these
values is low enough, they should be exposed anyway, just for the sake
of uniformity (these would be the only two getrusage values not
present in /proc/pid/stat).

If the decimal representation of values in /proc/pid/stat has such
unpleasant overheads, then I wonder if that is something worth fixing,
whether the context switch counts are added or not?  It occurs to me
that it would be easy to add support for a hex version of
/proc/pid/stat with very little additional code, by using an alternate
sprintf format string in fs/proc/array.c:do_task_stat().  I assume
that procps could be adapted quite easily to take advantage of this?


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-18 Thread David Wragg
The kernel already maintains context switch counts for each task, and
exposes them through getrusage(2).  These counters can also be used
more generally to track which processes on the system are active
(i.e. getting scheduled to run), but getrusage is too constrained to
use it in that way.

This patch (against 2.6.19/2.6.19.1) adds the four context switch
values (voluntary context switches, involuntary context switches, and
the same values accumulated from terminated child processes) to the
end of /proc/*/stat, similarly to min_flt, maj_flt and the time used
values.

Signed-off-by: David Wragg <[EMAIL PROTECTED]>

diff -uprN --exclude='*.o' --exclude='*~' --exclude='.*' 
linux-2.6.19.1/fs/proc/array.c linux-2.6.19.1.build/fs/proc/array.c
--- linux-2.6.19.1/fs/proc/array.c  2006-12-18 14:35:36.0 +
+++ linux-2.6.19.1.build/fs/proc/array.c2006-12-18 14:43:21.0 
+
@@ -327,6 +327,8 @@ static int do_task_stat(struct task_stru
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long  min_flt = 0,  maj_flt = 0;
cputime_t cutime, cstime, utime, stime;
+   unsigned long cnvcsw = 0, cnivcsw = 0;
+   unsigned long  nvcsw = 0,  nivcsw = 0;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
@@ -369,6 +371,8 @@ static int do_task_stat(struct task_stru
cmaj_flt = sig->cmaj_flt;
cutime = sig->cutime;
cstime = sig->cstime;
+   cnvcsw = sig->cnvcsw;
+   cnivcsw = sig->cnivcsw;
rsslim = sig->rlim[RLIMIT_RSS].rlim_cur;
 
/* add up live thread stats at the group level */
@@ -379,6 +383,8 @@ static int do_task_stat(struct task_stru
maj_flt += t->maj_flt;
utime = cputime_add(utime, t->utime);
stime = cputime_add(stime, t->stime);
+   nvcsw += t->nvcsw;
+   nivcsw += t->nivcsw;
t = next_thread(t);
} while (t != task);
 
@@ -386,6 +392,8 @@ static int do_task_stat(struct task_stru
maj_flt += sig->maj_flt;
utime = cputime_add(utime, sig->utime);
stime = cputime_add(stime, sig->stime);
+   nvcsw += sig->nvcsw;
+   nivcsw += sig->nivcsw;
}
 
sid = sig->session;
@@ -404,6 +412,8 @@ static int do_task_stat(struct task_stru
maj_flt = task->maj_flt;
utime = task->utime;
stime = task->stime;
+   nvcsw = task->nvcsw;
+   nivcsw = task->nivcsw;  
}
 
/* scale priority and nice values from timeslices to -20..20 */
@@ -420,7 +430,7 @@ static int do_task_stat(struct task_stru
 
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu %lu %lu %lu %lu\n",
task->pid,
tcomm,
state,
@@ -465,7 +475,12 @@ static int do_task_stat(struct task_stru
task_cpu(task),
task->rt_priority,
task->policy,
-   (unsigned long long)delayacct_blkio_ticks(task));
+   (unsigned long long)delayacct_blkio_ticks(task),
+   nvcsw,
+   cnvcsw,
+   nivcsw,
+   cnivcsw);
+
if(mm)
mmput(mm);
return res;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-19 Thread David Wragg
Benjamin LaHaise <[EMAIL PROTECTED]> writes:
> On Mon, Dec 18, 2006 at 11:50:08PM +0000, David Wragg wrote:
>> This patch (against 2.6.19/2.6.19.1) adds the four context switch
>> values (voluntary context switches, involuntary context switches, and
>> the same values accumulated from terminated child processes) to the
>> end of /proc/*/stat, similarly to min_flt, maj_flt and the time used
>> values.
>
> Please put these into new files, as the stat files in /proc are 
> horribly overloaded and have always been somewhat problematic 
> when it comes to changing how things are reported due to internal 
> changes to the kernel.  Cheers,

The delay accounting value was added to the end of /proc/pid/stat back
in July without discussion, so I assumed this approach was still
considered satisfactory.

Putting just these four values into a new file would seem a little
odd, since they have a lot in common with the other getrusage values
that are already in /proc/pid/stat.  One possibility is to add
/proc/pid/rusage, mirroring the full struct rusage in text form, since
struct rusage is already part of the kernel ABI (though Linux doesn't
fill in half of the values).

Or perhaps it makes sense to reorganize all the values from
/proc/pid/stat and its siblings into a sysfs-like one-value-per-file
structure, though that might introduce atomicity and efficiency issues
(calculating some of the values involves iterating over the threads in
the process; with everything in one file, these loops are folded
together).

Any thoughts?


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-20 Thread David Wragg
"Albert Cahalan" <[EMAIL PROTECTED]> writes:
> On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote:
>> This patch (against 2.6.19/2.6.19.1) adds the four context
>> switch values (voluntary context switches, involuntary
>> context switches, and the same values accumulated from
>> terminated child processes) to the end of /proc/*/stat,
>> similarly to min_flt, maj_flt and the time used values.
>
> Hmmm, OK, do people have a use for these values?

My reason for writing the patch was to track which processes are
active (i.e. got scheduled to run) by polling these context switch
values.  The time used values are not a reliable way to detect process
activity on fast machines.  So for example, when sorting by %CPU, top
often shows many processes using 0% CPU, despite the fact that these
processes are running occasionally.  If top sorted by (%CPU, context
switch count delta), it might give a more useful display of which
processes are active on the system.

More generally, it seems perverse to track these context switch values
but only expose them through the constrained getrusage interface.  If
they are worth having, why aren't they worth exposing in the same way
as all other process info?

> [...]
>> Putting just these four values into a new file would seem a little
>> odd, since they have a lot in common with the other getrusage values
>> that are already in /proc/pid/stat.  One possibility is to add
>> /proc/pid/rusage, mirroring the full struct rusage in text form, since
>> struct rusage is already part of the kernel ABI (though Linux doesn't
>> fill in half of the values).
>
> Since we already have a struct defined and all...
>
> sys_get_rusage(int pid)

That would be a much more useful system call than getrusage.  But why
have two ways of retrieving process info, /proc and a sys_get_rusage,
exposing differing subsets of process information?


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/