Re: [patch] oom: kill all threads that share mm with killed task

2007-04-24 Thread David Rientjes
On Mon, 23 Apr 2007, Christoph Lameter wrote:

 Obvious fix. It was broken by
  
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584
 Dec 7. So its in 2.6.20 and later. Candiate for stable?
 

I agree it's obvious enough that it should be included in stable.  
Otherwise the entire iteration becomes a big no-op and it won't alleviate 
the OOM condition in one call to out_of_memory() because there may be 
outstanding tasks with the shared -mm.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Transparently handle .symbol lookup for kprobes

2007-04-24 Thread Paul Mackerras
Srinivasa Ds writes:

 + } else {\
 + char dot_name[KSYM_NAME_LEN+1]; \
 + dot_name[0] = '.';  \
 + dot_name[1] = '\0'; \
 + strncat(dot_name, name, KSYM_NAME_LEN); \

Assuming the kernel strncat works like the userspace one does, there
is a possibility that dot_name[] won't be properly null-terminated
here.  If strlen(name) = KSYM_NAME_LEN-1, then strncat will set
dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch
dot_name[KSYM_NAME_LEN].

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Paul Mackerras
Christoph Hellwig writes:

 The first question is obviously, is this really something we want?
 spawning kernel thread on demand without reaping them properly seems
 quite dangerous.

What specifically has to be done to reap a kernel thread?  Are you
concerned about the number of threads, or about having zombies hanging
around?

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


SOME STUFF ABOUT REISER4

2007-04-24 Thread lkml777
On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper
[EMAIL PROTECTED] said:

 I know that this whole effort has been put in disarray by the
 prosecution of Hans Reiser, but I'm curious as to its status. Is
 Reiser4 going to be going into the Linus kernel anytime soon? Is there
 somewhere I should be looking to find this out without wasting bandwidth
 here?

There was a thread the other day, that talked about Reiser4.

It took a while but I have found it (actually two)

http://lkml.org/lkml/2007/4/5/360
http://lkml.org/lkml/2007/4/9/4

You may want to check them out.
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Access your email from home and the web

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes

2007-04-24 Thread Jeremy Fitzhardinge
Roland McGrath wrote:
 I have to admit I still don't really understand all this.  Is it
 documented somewhere?
 

 I have explained it in public more than once, but I don't know off hand
 anywhere that was helpfully recorded.
   

Thanks very much.  I'd been poking about, but the closest I came to an
actual description was various patches fixing bugs, so it was a little
incomplete.

 For example, a Xen-enabled kernel can use a single vDSO image (or a single
 pair of int80/sysenter images), containing the nosegneg hwcap note.  When
 there is no need for it (native or hvm or 64-bit hv or whatever), it just
 clears the mask word.  If you actually do this, you'll want to modify the
 NOTE_KERNELCAP_BEGIN macro to define a global label you can use with VDSO_SYM.
   

Thanks for the pointer.  I'd been getting a bit of heat for enabling the
nonegseg flag unconditionally.  If I can make Xen-specific then that
will be one less source of complaints.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Arjan van de Ven wrote:
Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 



there is actually 2 and not just 1 X server, and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple move the window a bit becomes quite a bit of a CPU
hog already.


Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?


It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like ls -lR / in an xterm.




The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.


Which still doesn't justify an elaborate points sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NonExecutable Bit in 32Bit

2007-04-24 Thread Cestonaro, Thilo \(external\)
Hey,

is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch?
When yes, is there a special argument for it not to be used?

Ciao Thilo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.

2007-04-24 Thread Vivek Goyal
On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote:
 
 Currently because vmlinux does not reflect that the kernel is relocatable
 we still have to support CONFIG_PHYSICAL_START.  So this patch adds a small
 c program to do what we cannot do with a linker script, set the elf header
 type to ET_DYN.
 
 This should remove the last obstacle to removing CONFIG_PHYSICAL_START
 on x86_64.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

[Dropping fastboot mailing list from CC as kexec mailing list is new list
 for this discussion]

[..]
 +void file_open(const char *name)
 +{
 + if ((fd = open(name, O_RDWR, 0))  0)
 + die(Unable to open `%s': %m, name);
 +}
 +
 +static void mketrel(void)
 +{
 + unsigned char e_type[2];
 + if (read(fd, e_ident, sizeof(e_ident)) != sizeof(e_ident))
 + die(Cannot read ELF header: %s\n, strerror(errno));
 +
 + if (memcmp(e_ident, ELFMAG, 4) != 0)
 + die(No ELF magic\n);
 +
 + if ((e_ident[EI_CLASS] != ELFCLASS64) 
 + (e_ident[EI_CLASS] != ELFCLASS32))
 + die(Unrecognized ELF class: %x\n, e_ident[EI_CLASS]);
 + 
 + if ((e_ident[EI_DATA] != ELFDATA2LSB) 
 + (e_ident[EI_DATA] != ELFDATA2MSB))
 + die(Unrecognized ELF data encoding: %x\n, e_ident[EI_DATA]);
 +
 + if (e_ident[EI_VERSION] != EV_CURRENT)
 + die(Unknown ELF version: %d\n, e_ident[EI_VERSION]);
 +
 + if (e_ident[EI_DATA] == ELFDATA2LSB) {
 + e_type[0] = ET_REL  0xff;
 + e_type[1] = ET_REL  8;
 + } else {
 + e_type[1] = ET_REL  0xff;
 + e_type[0] = ET_REL  8;
 + }

Hi Eric,

Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux
as it does not find it to be executable type.

I am not well versed with various conventions but if I go through Executable
and Linking Format document, this is what it says about various file types.

• A relocatable file holds code and data suitable for linking with other
  object files to create an executable or a shared object file.

• An executable file holds a program suitable for execution.

• A shared object file holds code and data suitable for linking in two
  contexts. First, the link editor may process it with other relocatable and
  shared object files to create another object file. Second, the dynamic
  linker combines it with an executable file and other shared objects
  to create a process image.

So above does not seem to fit in the ET_REL type. We can't relink this
vmlinux? And it does not seem to fit in ET_DYN definition too. We are
not relinking this vmlinux with another executable or other relocatable
files.

I remember once you mentioned the term dynamic executable which can be
loaded at a non-compiled address and let run without requiring any
relocation processing. This vmlinux will fall in that category but can't 
relate it to standard elf file definitions.

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

  The cases are fundamentally different in behavior, because in the 
  first case, X hardly consumes the time it would get in any scheme, 
  while in the second case X really is CPU bound and will happily 
  consume any CPU time it can get.
 
 Which still doesn't justify an elaborate points sharing scheme. 
 Whichever way you look at that that's just another way of giving X 
 more CPU bandwidth and there are simpler ways to give X more CPU if it 
 needs it.  However, I think there's something seriously wrong if it 
 needs the -19 nice that I've heard mentioned.

Gene has done some testing under CFS with X reniced to +10 and the 
desktop still worked smoothly for him. So CFS does not 'need' a reniced 
X. There are simply advantages to negative nice levels: for example 
screen refreshes are smoother on any scheduler i tried. BUT, there is a 
caveat: on non-CFS schedulers i tried X is much more prone to get into 
'overscheduling' scenarios that visibly hurt X's performance, while on 
CFS there's a max of 1000-1500 context switches a second at nice -10. 
(which, considering the cost of a context switch is well under 1% 
overhead.)

So, my point is, the nice level of X for desktop users should not be set 
lower than a low limit suggested by that particular scheduler's author. 
That limit is scheduler-specific. Con i think recommends a nice level of 
-1 for X when using SD [Con, can you confirm?], while my tests show that 
if you want you can go as low as -10 under CFS, without any bad 
side-effects. (-19 was a bit too much)

 [...]  You might as well just run it as a real time process.

hm, that would be a bad idea under any scheduler (including CFS), 
because real time processes can starve other processes indefinitely.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NonExecutable Bit in 32Bit

2007-04-24 Thread William Heimbigner

On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote:


Hey,

is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch?
When yes, is there a special argument for it not to be used?

Ciao Thilo

I don't think so - some i386 cpus definitely have support for the NX bit.

Would having this be supported in i386 help debugging (and security) 
significantly?


William Heimbigner
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2] Fixes and cleanups for earlyprintk aka boot console.

2007-04-24 Thread Andrew Morton
On Thu, 15 Mar 2007 16:46:39 +0100 Gerd Hoffmann [EMAIL PROTECTED] wrote:

 The console subsystem already has an idea of a boot console, using the
 CON_BOOT flag.  The implementation has some flaws though.  The major
 problem is that presence of a boot console makes register_console()
 ignore any other console devices (unless explicitly specified on the
 kernel command line).
 
 This patch fixes the console selection code to *not* consider a boot
 console a full-featured one, so the first non-boot console registering
 will become the default console instead.  This way the unregister call
 for the boot console in the register_console() function actually
 triggers and the handover from the boot console to the real console
 device works smoothly.  Added a printk for the handover, so you know
 which console device the output goes to when the boot console stops
 printing messages.
 
 The disable_early_printk() call is obsolete with that patch, explicitly
 disabling the early console isn't needed any more as it works
 automagically with that patch.
 
 I've walked through the tree, dropped all disable_early_printk()
 instances found below arch/ and tagged the consoles with CON_BOOT if
 needed.  The code is tested on x86, sh (thanks to Paul) and mips
 (thanks to Ralf).
 
 Changes to last version: Rediffed against -rc3, adapted to mips
 cleanups by Ralf, fixed udbg-immortal cmd line arg on powerpc.

I get this, across netconsole:

[17179569.184000] console handover: boot [earlyvga_f_0] - real [tty0]

wanna take a look at why there's cruft in bootconsole-name please?

in grub.conf I have

kernel /boot/bzImage-2.6.21-rc7-mm1 ro root=LABEL=/ rhgb vga=0x263 
[EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:0D:56:C6:C6:CC profile=1 
earlyprintk=vga resume=8:5 time

and I'm using

http://userweb.kernel.org/~akpm/config-sony.txt

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Transparently handle .symbol lookup for kprobes

2007-04-24 Thread Srinivasa Ds
Paul Mackerras wrote:
 Srinivasa Ds writes:
 
 +} else {\
 +char dot_name[KSYM_NAME_LEN+1]; \
 +dot_name[0] = '.';  \
 +dot_name[1] = '\0'; \
 +strncat(dot_name, name, KSYM_NAME_LEN); \
 
 Assuming the kernel strncat works like the userspace one does, there
 is a possibility that dot_name[] won't be properly null-terminated
 here.  If strlen(name) = KSYM_NAME_LEN-1, then strncat will set
 dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch
 dot_name[KSYM_NAME_LEN].

Irrespective of length of the string, kernel implementation of
strncat(lib/string.c) ensures that last character of string is set to
null. So dot_name[] is always null terminated.


char *strncat(char *dest, const char *src, size_t count)
{
char *tmp = dest;

if (count) {
while (*dest)
dest++;
while ((*dest++ = *src++) != 0) {
if (--count == 0) {
*dest = '\0';
break;
}
}
}
return tmp;
}
EXPORT_SYMBOL(strncat);
===

Is this OK then ??


Thanks
 Srinivasa DS

 
 Paul.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 The softlockup watchdog is currently a nuisance in a virtual machine,
 since the whole system could have the CPU stolen from it for a long
 period of time.  While it would be unlikely for a guest domain to be
 denied timer interrupts for over 10s, it could happen and any softlockup
 message would be completely spurious.
 
 Earlier I proposed that sched_clock() return time in unstolen
 nanoseconds, which is how Xen and VMI currently implement it.  If the
 softlockup watchdog uses sched_clock() to measure time, it would
 automatically ignore stolen time, and therefore only report when the
 guest itself locked up.  When running native, sched_clock() returns
 real-time nanoseconds, so the behaviour would be unchanged.
 
 Note that sched_clock() used this way is inherently per-cpu, so this
 patch makes sure that the per-processor watchdog thread initialized
 its own timestamp.

This patch
(ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
causes six failures in the locking self-tests, which I must say is rather
clever of it.


Here's the first one:

[17179569.184000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar
[17179569.184000] ... MAX_LOCKDEP_SUBCLASSES:8
[17179569.184000] ... MAX_LOCK_DEPTH:  30
[17179569.184000] ... MAX_LOCKDEP_KEYS:2048
[17179569.184000] ... CLASSHASH_SIZE:   1024
[17179569.184000] ... MAX_LOCKDEP_ENTRIES: 8192
[17179569.184000] ... MAX_LOCKDEP_CHAINS:  16384
[17179569.184000] ... CHAINHASH_SIZE:  8192
[17179569.184000]  memory used by lock dependency info: 992 kB
[17179569.184000]  per task-struct memory footprint: 1200 bytes
[17179569.184000] 
[17179569.184000] | Locking API testsuite:
[17179569.184000] 

[17179569.184000]  | spin |wlock |rlock |mutex 
| wsem | rsem |
[17179569.184000]   
--
[17179569.184000]  A-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184000]  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184000]  A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184001]  A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184002]  A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184003]  A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184004]  A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184005] double unlock:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]   initialize held:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]  bad unlock order:  ok  |  ok  |  ok  |  ok  
|  ok  |  ok  |
[17179569.184006]   
--
[17179569.184006]   recursive read-lock: |  ok  |   
  |  ok  |
[17179569.184006]recursive read-lock #2: |  ok  |   
  |  ok  |
[17179569.184007] mixed read-write-lock: |  ok  |   
  |  ok  |
[17179569.184007] mixed write-read-lock: |  ok  |   
  |  ok  |
[17179569.184007]   
--
[17179569.184007]  hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[17179569.184007]  soft-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[17179569.184007]  hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[17179569.184007]  soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[17179569.184007]sirq-safe-A = hirqs-on/12:  ok  |  ok  |irq event 
stamp: 458
[17179569.184007] hardirqs last  enabled at (458): [c01e4116] 
irqsafe2A_rlock_12+0x96/0xa3
[17179569.184007] hardirqs last disabled at (457): [c01095b9] 
sched_clock+0x5e/0xe9
[17179569.184007] softirqs last  enabled at (454): [c01e4101] 
irqsafe2A_rlock_12+0x81/0xa3
[17179569.184007] softirqs last disabled at (450): [c01e408b] 
irqsafe2A_rlock_12+0xb/0xa3
[17179569.184007] FAILED| [c0104cf0] dump_trace+0x63/0x1ec
[17179569.184007]  [c0104e93] show_trace_log_lvl+0x1a/0x30
[17179569.184007]  [c01059ec] show_trace+0x12/0x14
[17179569.184007]  [c0105a45] dump_stack+0x16/0x18
[17179569.184007]  [c01e1eb5] dotest+0x6b/0x3d0
[17179569.184007]  [c01eb249] locking_selftest+0x915/0x1a58
[17179569.184007]  [c048c979] start_kernel+0x1d0/0x2a2
[17179569.184007]  ===
[17179569.184007] 
[17179569.184007]sirq-safe-A = hirqs-on/21:irq event stamp: 462

Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46

2007-04-24 Thread Ingo Molnar

* Ed Tomlinson [EMAIL PROTECTED] wrote:

  SD 0.46 1-2 FPS
  cfs v5 nice -19 219-233 FPS
  cfs v5 nice 0   1000-1996
cfs v5 nice -10  60-65 FPS

the problem is, the glxgears portion of this test is an _inverse_ 
testcase.

The reason? glxgears on true 3D hardware will _not_ use X, it will 
directly use the 3D driver of the kernel. So by renicing X to -19 you 
give the xterms more chance to show stuff - the performance of the 
glxgears will 'degrade' - but that is what you asked for: glxgears is 
'just another CPU hog' that competes with X, it's not a true X client.

if you are after glxgears performance in this test then you'll get the 
best performance out of this by renicing X to +19 or even SCHED_BATCH.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
 On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
 wrote:

   
 The softlockup watchdog is currently a nuisance in a virtual machine,
 since the whole system could have the CPU stolen from it for a long
 period of time.  While it would be unlikely for a guest domain to be
 denied timer interrupts for over 10s, it could happen and any softlockup
 message would be completely spurious.

 Earlier I proposed that sched_clock() return time in unstolen
 nanoseconds, which is how Xen and VMI currently implement it.  If the
 softlockup watchdog uses sched_clock() to measure time, it would
 automatically ignore stolen time, and therefore only report when the
 guest itself locked up.  When running native, sched_clock() returns
 real-time nanoseconds, so the behaviour would be unchanged.

 Note that sched_clock() used this way is inherently per-cpu, so this
 patch makes sure that the per-processor watchdog thread initialized
 its own timestamp.
 

 This patch
 (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
 causes six failures in the locking self-tests, which I must say is rather
 clever of it.
   

Interesting.  Which variation of sched_clock do you have in your tree at
the moment?

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
* Peter Williams [EMAIL PROTECTED] wrote:
  The cases are fundamentally different in behavior, because in the
  first case, X hardly consumes the time it would get in any scheme,
  while in the second case X really is CPU bound and will happily
  consume any CPU time it can get.

 Which still doesn't justify an elaborate points sharing scheme.
 Whichever way you look at that that's just another way of giving X
 more CPU bandwidth and there are simpler ways to give X more CPU if it
 needs it.  However, I think there's something seriously wrong if it
 needs the -19 nice that I've heard mentioned.

Gene has done some testing under CFS with X reniced to +10 and the
desktop still worked smoothly for him.

As a data point here, and probably nothing to do with X, but I did manage to 
lock it up, solid, reset button time tonight, by wanting 'smart' to get done 
with an update session after amanda had started.  I took both smart processes 
I could see in htop all the way to -19, but when it was about done about 3 
minutes later, everything came to an instant, frozen, reset button required 
lockup.  I should have stopped at -17 I guess. :(

So CFS does not 'need' a reniced 
X. There are simply advantages to negative nice levels: for example
screen refreshes are smoother on any scheduler i tried. BUT, there is a
caveat: on non-CFS schedulers i tried X is much more prone to get into
'overscheduling' scenarios that visibly hurt X's performance, while on
CFS there's a max of 1000-1500 context switches a second at nice -10.
(which, considering the cost of a context switch is well under 1%
overhead.)

So, my point is, the nice level of X for desktop users should not be set
lower than a low limit suggested by that particular scheduler's author.
That limit is scheduler-specific. Con i think recommends a nice level of
-1 for X when using SD [Con, can you confirm?], while my tests show that
if you want you can go as low as -10 under CFS, without any bad
side-effects. (-19 was a bit too much)

 [...]  You might as well just run it as a real time process.

hm, that would be a bad idea under any scheduler (including CFS),
because real time processes can starve other processes indefinitely.

   Ingo



-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
I have discovered that all human evil comes from this, man's being unable
to sit still in a room.
-- Blaise Pascal
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Ingo Molnar wrote:


static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
struct rb_node *curr, *next, *first;
struct task_struct *p_next;

/*
 * yield-to support: if we are on the same runqueue then
 * give half of our wait_runtime (if it's positive) to the other task:
 */
if (p_to  p-wait_runtime  0) {
p-wait_runtime = 1;
p_to-wait_runtime += p-wait_runtime;
}

the above is the basic expression of: charge a positive bank balance. 



[..]

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]


Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?


Ingo


Rogan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett [EMAIL PROTECTED] wrote:

  Gene has done some testing under CFS with X reniced to +10 and the 
  desktop still worked smoothly for him.
 
 As a data point here, and probably nothing to do with X, but I did 
 manage to lock it up, solid, reset button time tonight, by wanting 
 'smart' to get done with an update session after amanda had started.  
 I took both smart processes I could see in htop all the way to -19, 
 but when it was about done about 3 minutes later, everything came to 
 an instant, frozen, reset button required lockup.  I should have 
 stopped at -17 I guess. :(

yeah, i guess this has little to do with X. I think in your scenario it 
might have been smarter to either stop, or to renice the workloads that 
took away CPU power from others to _positive_ nice levels. Negative nice 
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the 
SysRq-N [SysRq-Nice] implementation in my tree to not only change 
real-time tasks to SCHED_OTHER, but to also renice negative nice levels 
back to 0 - this will show up in -v6. That way you'd only have had to 
hit SysRq-N to get the system out of the wedge.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote:
 On Friday April 20, [EMAIL PROTECTED] wrote:
  Scale writeback cache per backing device, proportional to its writeout 
  speed.
 
 So it works like this:
 
  We account for writeout in full pages.
  When a page has the Writeback flag cleared, we account that as a
  successfully retired write for the relevant bdi.
  By using floating averages we keep track of how many writes each bdi
  has retired 'recently' where the unit of time in which we understand
  'recently' is a single page written.

That is actually that period I keep referring to. So recently is the
last 'period' number of writeout completions.

  We keep a floating average for each bdi, and a floating average for
  the total writeouts (that 'average' is, of course, 1.)

1 in the sense of unity, yes :-)

  Using these numbers we can calculate what faction of 'recently'
  retired writes were retired by each bdi (get_writeout_scale).
 
  Multiplying this fraction by the system-wide number of pages that are
  allowed to be dirty before write-throttling, we get the number of
  pages that the bdi can have dirty before write-throttling the bdi.
 
  I note that the same fraction is *not* applied to background_thresh.
  Should it be?  I guess not - there would be interesting starting
  transients, as a bdi which had done no writeout would not be allowed
  any dirty pages, so background writeout would start immediately,
  which isn't what you want... or is it?

This is something I have not been able to come to a conclusive answer
yet,... 

  For each bdi we also track the number of (dirty, writeback, unstable)
  pages and do not allow this to exceed the limit set for this bdi.
 
  The calculations involving 'reserve' in get_dirty_limits are a little
  confusing.  It looks like you calculating how much total head-room
  there is for the bdi (pages that the system can still dirty - pages
  this bdi has dirty) and making sure the number returned in pbdi_dirty
  doesn't allow more than that to be used.  

Yes, it limits the earned share of the total dirty limit to the possible
share, ensuring that the total dirty limit is never exceeded.

This is especially relevant when the proportions change faster than the
pages get written out, ie. when the period  total dirty limit.

 This is probably a
  reasonable thing to do but it doesn't feel like the right place.  I
  think get_dirty_limits should return the raw threshold, and
  balance_dirty_pages should do both tests - the bdi-local test and the
  system-wide test.

Ok, that makes sense I guess.

  Currently you have a rather odd situation where
 + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 + break;
  might included numbers obtained with bdi_stat_sum being compared with
  numbers obtained with bdi_stat.

Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers,
whereas the others could be bdi_stat_sum(). I think this is ok, since
the threshold is a 'guess' anyway, we just _need_ to ensure we do not
get trapped by writeouts not arriving (due to getting stuck in the per
cpu deltas).  -- I have all this commented in the new version.

  With these patches, the VM still (I think) assumes that each BDI has
  a reasonable queue limit, so that writeback_inodes will block on a
  full queue.  If a BDI has a very large queue, balance_dirty_pages
  will simply turn lots of DIRTY pages into WRITEBACK pages and then
  think We've done our duty without actually blocking at all.

It will block once we exceed the total number of dirty pages allowed for
that BDI. But yes, this does not take away the need for queue limits.

This work was primarily aimed at allowing multiple queues to not
interfere as much, so they all can make progress and not get starved.

  With the extra accounting that we now have, I would like to see
  balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
  actually less than 'threshold'.  This would probably mean that we
  would need to support per-bdi background_writeout to smooth things
  out.  Maybe that it fodder for another patch-set.

Indeed, I still have to wrap my mind around the background thing. Your
input is appreciated.

  You set:
 + vm_cycle_shift = 1 + ilog2(vm_total_pages);
 
  Can you explain that?

You found the one random knob I hid :-)

   My experience is that scaling dirty limits
  with main memory isn't what we really want.  When you get machines
  with very large memory, the amount that you want to be dirty is more
  a function of the speed of your IO devices, rather than the amount
  of memory, otherwise you can sometimes see large filesystem lags
  ('sync' taking minutes?)
 
  I wonder if it makes sense to try to limit the dirty data for a bdi
  to the amount that it can write out in some period of time - maybe 3
  seconds.  Probably configurable.  You seem to have almost all the
  infrastructure in place to do that, and I think it 

Re: [patch 1/4] Ignore stolen time in the softlockup watchdog

2007-04-24 Thread Andrew Morton
On Mon, 23 Apr 2007 23:58:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Andrew Morton wrote:
  On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
  wrote:
 

  The softlockup watchdog is currently a nuisance in a virtual machine,
  since the whole system could have the CPU stolen from it for a long
  period of time.  While it would be unlikely for a guest domain to be
  denied timer interrupts for over 10s, it could happen and any softlockup
  message would be completely spurious.
 
  Earlier I proposed that sched_clock() return time in unstolen
  nanoseconds, which is how Xen and VMI currently implement it.  If the
  softlockup watchdog uses sched_clock() to measure time, it would
  automatically ignore stolen time, and therefore only report when the
  guest itself locked up.  When running native, sched_clock() returns
  real-time nanoseconds, so the behaviour would be unchanged.
 
  Note that sched_clock() used this way is inherently per-cpu, so this
  patch makes sure that the per-processor watchdog thread initialized
  its own timestamp.
  
 
  This patch
  (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch)
  causes six failures in the locking self-tests, which I must say is rather
  clever of it.

 
 Interesting.

I'll say.

  Which variation of sched_clock do you have in your tree at
 the moment?

Andi's, plus the below fix.

Sigh.  I thought I was only two more bugs away from a release, then...


[18014389.347124] BUG: unable to handle kernel paging request at virtual 
address 6b6b7193
[18014389.347142]  printing eip:
[18014389.347149] c029a80c
[18014389.347156] *pde = 
[18014389.347166] Oops:  [#1]
[18014389.347174] Modules linked in: i915 drm ipw2200 sonypi ipv6 autofs4 hidp 
l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 
xt_state nf_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables 
cpufreq_ondemand video sbs button battery asus_acpi ac nvram ohci1394 ieee1394 
ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss 
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 
sr_mod cdrom snd_timer ieee80211 i2c_i801 piix ieee80211_crypt i2c_core generic 
snd soundcore snd_page_alloc ext3 jbd ide_disk ide_core
[18014389.347520] CPU:0
[18014389.347521] EIP:0060:[c029a80c]Tainted: G  D VLI
[18014389.347522] EFLAGS: 00010296   (2.6.21-rc7-mm1 #35)
[18014389.347547] EIP is at input_release_device+0x8/0x4e
[18014389.347555] eax: c99709a8   ebx: 6b6b6b6b   ecx: 0286   edx: 
[18014389.347563] esi: 6b6b6b6b   edi: c99709cc   ebp: c21e3d40   esp: c21e3d38
[18014389.347571] ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
[18014389.347580] Process khubd (pid: 159, ti=c21e2000 task=c20a62f0 
task.ti=c21e2000)
[18014389.347588] Stack: 6b6b6b6b c99709a8 c21e3d60 c029b489 c2014ec8 c9182000 
c96b167c c9970954 
[18014389.347655]c9970954 c99709cc c21e3d80 c029d401 c9977a6c c96b1000 
c21e3d90 c9970954 
[18014389.347708]c99709a8 c9164000 c21e3d90 c029d4b5 c96b1000 c9970564 
c21e3db0 c029c50b 
[18014389.347771] Call Trace:
[18014389.347792]  [c029b489] input_close_device+0x13/0x51
[18014389.347810]  [c029d401] mousedev_destroy+0x29/0x7e
[18014389.347827]  [c029d4b5] mousedev_disconnect+0x5f/0x63
[18014389.347842]  [c029c50b] input_unregister_device+0x6a/0x100
[18014389.347858]  [c02abf9c] hidinput_disconnect+0x24/0x41
[18014389.347874]  [c02aef29] hid_disconnect+0x79/0xc9
[18014389.347889]  [c028e1db] usb_unbind_interface+0x47/0x8f
[18014389.347916]  [c0256852] __device_release_driver+0x74/0x90
[18014389.347933]  [c0256c5f] device_release_driver+0x37/0x4e
[18014389.347957]  [c02561c6] bus_remove_device+0x73/0x82
[18014389.347977]  [c02547c1] device_del+0x214/0x28c
[18014389.348132]  [c028bb72] usb_disable_device+0x62/0xc2
[18014389.348148]  [c0288893] usb_disconnect+0x99/0x126
[18014389.348163]  [c0288d2c] hub_thread+0x3a5/0xb07
[18014389.348178]  [c012cbe5] kthread+0x6e/0x79
[18014389.348194]  [c0104917] kernel_thread_helper+0x7/0x10
[18014389.348210]  ===
[18014389.348218] INFO: lockdep is turned off.
[18014389.348224] Code: 5b 5d c3 55 b9 f0 ff ff ff 8b 50 0c 89 e5 83 ba 28 06 
00 00 00 75 08 89 82 28 06 00 00 31 c9 5d 89 c8 c3 55 89 e5 56 53 8b 70 0c 39 
86 28 06 00 00 75 3a 8b 9e e4 08 00 00 c7 86 28 06 00 00 00 

I dunno.  I'll keep plugging for another couple hours then I'll shove
out what I have as a -mm snapshot whatsit.

Things are just ridiculous.  I'm thinking of having a hard-disk crash and
accidentally losing everything.



From: Andrew Morton [EMAIL PROTECTED]

WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to 
.init.text:sc_cpu_event from .data between 'sc_cpu_notifier' (at offset 0x2110) 
and 'mcelog'

Use hotcpu_notifier().  This takes care of making sure that the unused code

How do you send a reply to an email you have deleted.

2007-04-24 Thread lkml777
How do you send a reply to an email you have deleted?
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - I mean, what is it about a decent email service?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
* Gene Heskett [EMAIL PROTECTED] wrote:
  Gene has done some testing under CFS with X reniced to +10 and the
  desktop still worked smoothly for him.

 As a data point here, and probably nothing to do with X, but I did
 manage to lock it up, solid, reset button time tonight, by wanting
 'smart' to get done with an update session after amanda had started.
 I took both smart processes I could see in htop all the way to -19,
 but when it was about done about 3 minutes later, everything came to
 an instant, frozen, reset button required lockup.  I should have
 stopped at -17 I guess. :(

yeah, i guess this has little to do with X. I think in your scenario it
might have been smarter to either stop, or to renice the workloads that
took away CPU power from others to _positive_ nice levels. Negative nice
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the
SysRq-N [SysRq-Nice] implementation in my tree to not only change
real-time tasks to SCHED_OTHER, but to also renice negative nice levels
back to 0 - this will show up in -v6. That way you'd only have had to
hit SysRq-N to get the system out of the wedge.)

   Ingo

That sounds handy, particularly with idiots like me at the wheel...


-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
When a Banker jumps out of a window, jump after him--that's where the money 
is.
-- Robespierre
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett [EMAIL PROTECTED] wrote:

  (Btw., to protect against such mishaps in the future i have changed 
  the SysRq-N [SysRq-Nice] implementation in my tree to not only 
  change real-time tasks to SCHED_OTHER, but to also renice negative 
  nice levels back to 0 - this will show up in -v6. That way you'd 
  only have had to hit SysRq-N to get the system out of the wedge.)
 
 That sounds handy, particularly with idiots like me at the wheel...

by that standard i guess we tinkerers are all idiots ;)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Ingo Molnar wrote:


* Gene Heskett [EMAIL PROTECTED] wrote:


Gene has done some testing under CFS with X reniced to +10 and the
desktop still worked smoothly for him.


As a data point here, and probably nothing to do with X, but I did
manage to lock it up, solid, reset button time tonight, by wanting
'smart' to get done with an update session after amanda had started.
I took both smart processes I could see in htop all the way to -19,
but when it was about done about 3 minutes later, everything came to
an instant, frozen, reset button required lockup.  I should have
stopped at -17 I guess. :(


yeah, i guess this has little to do with X. I think in your scenario it
might have been smarter to either stop, or to renice the workloads that
took away CPU power from others to _positive_ nice levels. Negative nice
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the
SysRq-N [SysRq-Nice] implementation in my tree to not only change
real-time tasks to SCHED_OTHER, but to also renice negative nice levels
back to 0 - this will show up in -v6. That way you'd only have had to
hit SysRq-N to get the system out of the wedge.)


if you are trying to unwedge a system it may be a good idea to renice all tasks 
to 0, it could be that a task at +19 is holding a lock that something else is 
waiting for.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.

2007-04-24 Thread Eric W. Biederman
Vivek Goyal [EMAIL PROTECTED] writes:

 On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote:
 
 Currently because vmlinux does not reflect that the kernel is relocatable
 we still have to support CONFIG_PHYSICAL_START.  So this patch adds a small
 c program to do what we cannot do with a linker script, set the elf header
 type to ET_DYN.
 
 This should remove the last obstacle to removing CONFIG_PHYSICAL_START
 on x86_64.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

 [Dropping fastboot mailing list from CC as kexec mailing list is new list
  for this discussion]

 [..]
 +void file_open(const char *name)
 +{
 +if ((fd = open(name, O_RDWR, 0))  0)
 +die(Unable to open `%s': %m, name);
 +}
 +
 +static void mketrel(void)
 +{
 +unsigned char e_type[2];
 +if (read(fd, e_ident, sizeof(e_ident)) != sizeof(e_ident))
 +die(Cannot read ELF header: %s\n, strerror(errno));
 +
 +if (memcmp(e_ident, ELFMAG, 4) != 0)
 +die(No ELF magic\n);
 +
 +if ((e_ident[EI_CLASS] != ELFCLASS64) 
 +(e_ident[EI_CLASS] != ELFCLASS32))
 +die(Unrecognized ELF class: %x\n, e_ident[EI_CLASS]);
 +
 +if ((e_ident[EI_DATA] != ELFDATA2LSB) 
 +(e_ident[EI_DATA] != ELFDATA2MSB))
 +die(Unrecognized ELF data encoding: %x\n, e_ident[EI_DATA]);
 +
 +if (e_ident[EI_VERSION] != EV_CURRENT)
 +die(Unknown ELF version: %d\n, e_ident[EI_VERSION]);
 +
 +if (e_ident[EI_DATA] == ELFDATA2LSB) {
 +e_type[0] = ET_REL  0xff;
 +e_type[1] = ET_REL  8;
 +} else {
 +e_type[1] = ET_REL  0xff;
 +e_type[0] = ET_REL  8;
 +}

 Hi Eric,

 Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux
 as it does not find it to be executable type.

Doh.  It should be ET_DYN.  I had relocatable much to much on the brain,
and so I stuffed in the wrong type.

 I am not well versed with various conventions but if I go through Executable
 and Linking Format document, this is what it says about various file types.

 • A relocatable file holds code and data suitable for linking with other
   object files to create an executable or a shared object file.

 • An executable file holds a program suitable for execution.

 • A shared object file holds code and data suitable for linking in two
   contexts. First, the link editor may process it with other relocatable and
   shared object files to create another object file. Second, the dynamic
   linker combines it with an executable file and other shared objects
   to create a process image.

 So above does not seem to fit in the ET_REL type. We can't relink this
 vmlinux? And it does not seem to fit in ET_DYN definition too. We are
 not relinking this vmlinux with another executable or other relocatable
 files.

 I remember once you mentioned the term dynamic executable which can be
 loaded at a non-compiled address and let run without requiring any
 relocation processing. This vmlinux will fall in that category but can't 
 relate it to standard elf file definitions.

Sorry about that.  

ET_DYN without a PT_DYNAMIC segment, without a PT_INTERP segment,
and with a valid entry point is exactly that.  Loaders never perform
relocation processing on a ET_DYN executable but they are allowed to
shift all of the addresses by a single delta so long as all of the
alignment restrictions are honored.

Relocation processing when it happens comes from the dynamic linker,
which is set in PT_INTERP and the dynamic linker looks a PT_DYNAMIC
to figure out what relocations are available for processing.

The basic issue is that ld don't really comprehend what we are doing
since we are building a position independent executable in a way
that the normal tools don't allow, so we have to poke the header.

If we had compiled with -fPIC we could have specified -pie or
--pic-executable to ld and it would have done the right thing.
But as it is our executable only changes physical addresses and
not virtual addresses something completely foreign to ld.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* David Lang [EMAIL PROTECTED] wrote:

  (Btw., to protect against such mishaps in the future i have changed 
  the SysRq-N [SysRq-Nice] implementation in my tree to not only 
  change real-time tasks to SCHED_OTHER, but to also renice negative 
  nice levels back to 0 - this will show up in -v6. That way you'd 
  only have had to hit SysRq-N to get the system out of the wedge.)
 
 if you are trying to unwedge a system it may be a good idea to renice 
 all tasks to 0, it could be that a task at +19 is holding a lock that 
 something else is waiting for.

Yeah, that's possible too, but +19 tasks are getting a small but 
guaranteed share of the CPU so eventually it ought to release it. It's 
still a possibility, but i think i'll wait for a specific incident to 
happen first, and then react to that incident :-)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 yeah, i guess this has little to do with X. I think in your scenario 
 it might have been smarter to either stop, or to renice the workloads 
 that took away CPU power from others to _positive_ nice levels. 
 Negative nice levels can indeed be dangerous.

btw., was X itself at nice 0 or nice -10 when the lockup happened?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Rogan Dawes [EMAIL PROTECTED] wrote:

 if (p_to  p-wait_runtime  0) {
 p-wait_runtime = 1;
 p_to-wait_runtime += p-wait_runtime;
 }
 
 the above is the basic expression of: charge a positive bank balance. 
 
 
 [..]
 
  [note, due to the nanoseconds unit there's no rounding loss to worry 
  about.]
 
 Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?

yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Yasunori Goto
Mel-san.

I tested your patch (Thanks!). It worked. But..

 In my understanding, why ia64 doesn't use early_param() macro for mem= at el. 
 is that 
 it has to use mem= option at efi handling which is called before 
 parse_early_param().
 
 Current ia64's boot path is
  setup_arch()
 - efi handling - parse_early_param() - numa handling - pgdat/zone init
 
 kernelcore= option is just used at pgdat/zone initialization. (no arch 
 dependent part...)
 
 So I think just adding
 ==
 early_param(kernelcore,cmpdline_parse_kernelcore)
 ==
 to ia64 is ok.

Then, it can be common code.
How is this patch? I confirmed this can work well too.



When kernelcore boot option is specified, kernel can't boot up
on ia64. It is cause of eternal loop.
In addition, its code can be common code. This is fix for it.
I tested this patch on my ia64 box.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

-

 arch/i386/kernel/setup.c   |1 -
 arch/ia64/kernel/efi.c |2 --
 arch/powerpc/kernel/prom.c |1 -
 arch/ppc/mm/init.c |2 --
 arch/x86_64/kernel/e820.c  |1 -
 include/linux/mm.h |1 -
 mm/page_alloc.c|3 +++
 7 files changed, 3 insertions(+), 8 deletions(-)

Index: kernelcore/arch/ia64/kernel/efi.c
===
--- kernelcore.orig/arch/ia64/kernel/efi.c  2007-04-24 15:09:37.0 
+0900
+++ kernelcore/arch/ia64/kernel/efi.c   2007-04-24 15:25:22.0 +0900
@@ -423,8 +423,6 @@ efi_init (void)
mem_limit = memparse(cp + 4, cp);
} else if (memcmp(cp, max_addr=, 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
-   } else if (memcmp(cp, kernelcore=,11) == 0) {
-   cmdline_parse_kernelcore(cp+11);
} else if (memcmp(cp, min_addr=, 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
} else {
Index: kernelcore/arch/i386/kernel/setup.c
===
--- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 
+0900
+++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900
@@ -195,7 +195,6 @@ static int __init parse_mem(char *arg)
return 0;
 }
 early_param(mem, parse_mem);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 #ifdef CONFIG_PROC_VMCORE
 /* elfcorehdr= specifies the location of elf core header
Index: kernelcore/arch/powerpc/kernel/prom.c
===
--- kernelcore.orig/arch/powerpc/kernel/prom.c  2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/powerpc/kernel/prom.c   2007-04-24 15:30:25.0 
+0900
@@ -431,7 +431,6 @@ static int __init early_parse_mem(char *
return 0;
 }
 early_param(mem, early_parse_mem);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 /*
  * The device tree may be allocated below our memory limit, or inside the
Index: kernelcore/arch/ppc/mm/init.c
===
--- kernelcore.orig/arch/ppc/mm/init.c  2007-04-24 15:04:47.0 +0900
+++ kernelcore/arch/ppc/mm/init.c   2007-04-24 15:30:56.0 +0900
@@ -214,8 +214,6 @@ void MMU_setup(void)
}
 }
 
-early_param(kernelcore, cmdline_parse_kernelcore);
-
 /*
  * MMU_init sets up the basic memory mappings for the kernel,
  * including both RAM and possibly some I/O regions,
Index: kernelcore/arch/x86_64/kernel/e820.c
===
--- kernelcore.orig/arch/x86_64/kernel/e820.c   2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 
+0900
@@ -604,7 +604,6 @@ static int __init parse_memopt(char *p)
return 0;
 } 
 early_param(mem, parse_memopt);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 static int userdef __initdata;
 
Index: kernelcore/include/linux/mm.h
===
--- kernelcore.orig/include/linux/mm.h  2007-04-24 15:09:37.0 +0900
+++ kernelcore/include/linux/mm.h   2007-04-24 15:35:52.0 +0900
@@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
-extern int cmdline_parse_kernelcore(char *p);
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
Index: kernelcore/mm/page_alloc.c
===
--- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900
+++ kernelcore/mm/page_alloc.c  2007-04-24 16:00:21.0 +0900
@@ 

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 [...] That way you'd only have had to hit SysRq-N to get the system 
 out of the wedge.)

small correction: Alt-SysRq-N.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i802.11: fixed memory leak on multicasts

2007-04-24 Thread Markus Pietrek

Hi,

socket buffers were not always freed when receiving multicasts

Bye,
--
Markus Pietrek
Lead Software Engineer
Phone: +49-7667-908-501, Fax: +49-7667-908-200
mailto:[EMAIL PROTECTED]

FS Forth-Systeme GmbH
A Digi International Company
Kueferstr. 8, 79206 Breisach, Germany
Tax: 07008/12000 / VAT: DE142208834 / Reg. Amtsgericht Freiburg HRB 290212
Directors: Klaus Flesch, Subramanian Krishnan, Dieter Vesper
http://www.digi.com
Index: net/ieee80211/ieee80211_rx.c
===
RCS file: 
/data/vcs/cvs/fsforth_products/LxNETES/linux/net/ieee80211/ieee80211_rx.c,v
retrieving revision 1.5
retrieving revision 1.6
diff -c -r1.5 -r1.6
*** net/ieee80211/ieee80211_rx.c13 Apr 2007 12:39:38 -  1.5
--- net/ieee80211/ieee80211_rx.c23 Apr 2007 15:51:28 -  1.6
***
*** 860,868 
break;
}
  
!   if (is_packet_for_us)
if (!ieee80211_rx(ieee, skb, stats))
dev_kfree_skb_irq(skb);
return;
  
  drop_free:
--- 860,871 
break;
}
  
!   if (is_packet_for_us) {
if (!ieee80211_rx(ieee, skb, stats))
dev_kfree_skb_irq(skb);
+ } else
+ dev_kfree_skb_irq(skb);
+ 
return;
  
  drop_free:


cfs works fine for me

2007-04-24 Thread Hemmann, Volker Armin
Hello,

I have tried the cfs patches with 2.6.20.7 in the last days.

I am using KDE 3.5.6, gentoo unstable and have a dual core AMD64 system with 
1GB ram and a nvidia card (using the closed source drivers, yes I suck, but I 
love playing 3d games once in a while).

I don't have interactivity problems with plain kernel.org kernels (except when 
swapping a lot, swapping really sucks)
My system works well and is stable.

With the cfs patches, my system continues to work well. I have not seen any 
regressions, desktop is snappy, emerge'ing stuff (niced to +19), does not 
hurt and unreal tournament 2004 is as fast (or slow, depends on the 
situation) as always. It even looks like FPS under heavy stress (like 
onslaught torlan when lots of bots and me are fighting at a powernode), don't 
go down as low as with the mainline scheduler. Not a big difference, but it 
is there (20-25 with plain kernel.org kernel in extrem situations compared to 
30 with the cfs patches). Maybe I did not hit the worst case, playing is a 
little bit restricted at the moment - my wrist and ellbow hate me, but it 
looks promising. Apart from the worst case scenrios, FPS are more or less the 
same.

My usage consisted of surfing the web with konqueror, watching videos with 
xine and mplayer, using kmail (with tens of thousands of mails in different 
folders), looking at pictures with kuickshow, installing XFCE, asorted 
updates, typing lots and lots of stuff in kate and web forums, listening to 
mp3/ogg with amarok, playing pysol/kpat/lgeneral/wesnoth/ut2004/freecol, a 
lot of that parallel (not ut2004... I don't want to hurt my precious fps...).

Again, my system worked fine with the 'normal' scheduler, from the stuff I 
read in the lkml archives I must be some special kind of guy, so there was no 
improvement on the 'feels snappy or not' front, but there are also no 
regressions. So from my point of view, everything is fine with cfs and I 
would not mind having it as default scheduler. 

If you want specs of my hardware, my kernel config or any other information, 
just send me an email. I am not subscribed to lkml, nor can I read any of its 
archives in the next couple of days, which is one reason why I don't answer 
to one at the existing threads (I don't even know if there are some at the 
moment), so in case of an answer cc'ing me would be nice.

Glück Auf
Volker
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Michael Gerdau
Hi list,

with cfs-v5 finally booting on my machine I have run my daily
numbercrunching jobs on both cfs-v5 and sd-0.46, 2.6.21-v7 on
top of a stock openSUSE 10.2 (X86_64). Config for both kernel
is the same except for the X boost option in cfs-v5 which on
my system didn't work (X still was @ -19; I understand this will
be fixed in -v6). HZ is 250 in both.

System is a Dell XPS M1710, Intel Core2 2.33GHz, 4GB,
NVIDIA GeForce Go 7950 GTX with proprietary driver 1.0-9755

I'm running three single threaded perl scripts that do double
precision floating point math with little i/o after initially
loading the data.

Both cfs and sd showed very similar behavior when monitored in top.
I'll show more or less representative excerpt from a 10 minutes
log, delay 3sec.

sd-0.46
top - 00:14:24 up  1:17,  9 users,  load average: 4.79, 4.95, 4.80
Tasks:   3 total,   3 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.8%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   3348628k total,  1648560k used,  1700068k free,64392k buffers
Swap:  2097144k total,0k used,  2097144k free,   828204k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
 6671 mgd   33   0 95508  22m 3652 R  100  0.7  44:28.11 perl   

 
 6669 mgd   31   0 95176  22m 3652 R   50  0.7  43:50.02 perl   

 
 6674
 mgd   31   0 95368  22m 3652 R   50  0.7  47:55.29 perl



cfs-v5
top - 08:07:50 up 21 min,  9 users,  load average: 4.13, 4.16, 3.23
Tasks:   3 total,   3 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.5%us,  0.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   3348624k total,  1193500k used,  2155124k free,32516k buffers
Swap:  2097144k total,0k used,  2097144k free,   545568k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 
 6357 mgd   20   0 92024  19m 3652 R  100  0.6   8:54.21 perl   

 
 6356 mgd   20   0 91652  18m 3652 R   50  0.6  10:35.52 perl   

 
 6359 mgd   20   0 91700  18m 3652 R   50  0.6   8:47.32 perl   

 

What did surprise me is that cpu utilization had been spread 100/50/50
(round robin) most of the time. I did expect 66/66/66 or so.

What I also don't understand is the difference in load average, sd
constantly had higher values, the above figures are representative
for the whole log. I don't know which is better though.


Here are excerpts from a concurrently run vmstat 3 200:

sd-0.46
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 5  0  0 1702928  63664 82787600 067  458 1350 100  0  0  0
 3  0  0 1702928  63684 82787600 089  468 1362 100  0  0  0
 5  0  0 1702680  63696 82787600 0   132  461 1598 99  1  0  0
 8  0  0 1702680  63712 82789200 080  465 1180 99  1  0  0
 3  0  0 1702712  63732 82788400 067  453 1005 100  0  0  0
 4  0  0 1702792  63744 82792000 041  461 1138 100  0  0  0
 3  0  0 1702792  63760 82791600 057  456 1073 100  0  0  0
 3  0  0 1702808  63776 82792800 0   111  473 1095 100  0  0  0
 3  0  0 1702808  63788 82792800 081  461 1092 99  1  0  0
 3  0  0 1702188  63808 82792800 0   160  463 1437 99  1  0  0
 3  0  0 1702064  63884 82790000 0   229  479 1125 99  0  0  0
 4  0  0 1702064  63912 82797200 177  460 1108 100  0  0  0
 7  0  0 1702032  63920 82800000 040  463 1068 100  0  0  0
 4  0  0 1702048  63928 82800800 068  454 1114 100  0  0  0
11  0  0 1702048  63928 82800800 0 0  458 1001 100  0  0  0
 3  0  0 1701500  63960 82802000 0   

Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Cornelia Huck
On Tue, 24 Apr 2007 15:00:42 +1000,
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:

 Like anything else, modules should have separated the entrypoints for
 
  - Initiating a removal request
  - Releasing the module
 
 The former is use did rmmod, can unregister things from subsystems,
 etc... (and can file if the driver decides to refuse removal requests
 when it's busy doing things or whatever policy that module wants to
 implement).
 
 The later is called when all references to the modules have been
 dropped, it's a bit like the kref release (and could be implemented as
 one).

That sounds quite similar to the problems we have with kobject
refcounting vs. module unloading. The patchset I posted at
http://marc.info/?l=linux-kernelm=117679014404994w=2 exposes the
refcount of the kobject embedded in the module. Maybe the kthread code
could use that reference as well?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NonExecutable Bit in 32Bit

2007-04-24 Thread Tuncer Ayaz

On 4/24/07, William Heimbigner [EMAIL PROTECTED] wrote:

On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote:

 Hey,

 is it right, that the NX Bit is not used under i386-Arch but
 under x86_64-Arch?
 When yes, is there a special argument for it not to be used?

 Ciao Thilo
I don't think so - some i386 cpus definitely have support for
the NX bit.



In detail:
1) if your CPU has NX support (some 32bit Xeons do)
2) it is not disabled in the BIOS
3) you see 'nx' in the 'flags' line in /proc/cpuinfo
4) and you have a kernel with the following config options
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y

NX should just work.

[snip]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] [PATCH] eHCA: Add Modify Port verb

2007-04-24 Thread Christoph Raisch

Hi Hal,
you are correct,
with the current firmware version it will fail later.

Christoph R.

[EMAIL PROTECTED] wrote on 23.04.2007 18:55:59:

 Hi Joachim,

 On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote:
  Add Modify Port verb support to eHCA driver.
  ib_cm needs this to initialize properly.

 I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does
 this just fail later when it is attempted to be actually set ?

 -- Hal

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Kconfig: cleanup s390 v2.

2007-04-24 Thread Martin Schwidefsky
On Mon, 2007-04-23 at 10:45 -0700, Andrew Morton wrote:
  Andrew: I plan to add patches 1-5 to the for-andrew branch of the
  git390 repository if that is fine with you. The only thing that will
  be missing in the tree is the patch that disables wireless for s390.
  The code does compile but without hardware it is mute to have the
  config options. I'll wait until the git-wireless.patch is upstream.
  Patches 7-9 depend on patches found in -mm.
  
 
 umm, OK.  If it's Ok I think I'll duck it for now: -mm is full.
 
 Over-full, really: I've been working basically continuously since Friday
 getting the current dungpile to compile and boot, and it's still miles away
 from that.

I understand. I'll wait until -mm is a little bit smaller again. It is
just that someday I want to finish with the Kconfig cleanup, it has been
sitting on my harddriver for ages now.

-- 
blue skies,  IBM Deutschland Entwicklung GmbH
   MartinVorsitzender des Aufsichtsrats: Johann Weihen
 Geschäftsführung: Herbert Kircher
Martin Schwidefsky   Sitz der Gesellschaft: Böblingen
Linux on zSeries Registergericht: Amtsgericht Stuttgart,
   Development   HRB 243294

Reality continues to ruin my life. - Calvin.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Ingo Molnar

* Michael Gerdau [EMAIL PROTECTED] wrote:

 I'm running three single threaded perl scripts that do double 
 precision floating point math with little i/o after initially loading 
 the data.

thanks for the testing!

 What I also don't understand is the difference in load average, sd 
 constantly had higher values, the above figures are representative for 
 the whole log. I don't know which is better though.

hm, it's hard from here to tell that. What load average does the vanilla 
kernel report? I'd take that as a reference.

 Here are excerpts from a concurrently run vmstat 3 200:
 
 sd-0.46
 procs ---memory-- ---swap-- -io -system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
  5  0  0 1702928  63664 82787600 067  458 1350 100  0  0   0
  3  0  0 1702928  63684 82787600 089  468 1362 100  0  0   0
  5  0  0 1702680  63696 82787600 0   132  461 1598 99  1  0  0
  8  0  0 1702680  63712 82789200 080  465 1180 99  1  0  0

 cfs-v5
 procs ---memory-- ---swap-- -io -system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
  6  0  0 2157728  31816 54523600 0   103  543  748 100  0  0   0
  4  0  0 2157780  31828 54525600 063  435  752 100  0  0   0
  4  0  0 2157928  31852 54525600 0   105  424  770 100  0  0   0
  4  0  0 2157928  31868 54526800 0   261  457  763 100  0  0   0

interesting - CFS has half the context-switch rate of SD. That is 
probably because on your workload CFS defaults to longer 'timeslices' 
than SD. You can influence the 'timeslice length' under SD via 
/proc/sys/kernel/rr_interval (milliseconds units) and under CFS via 
/proc/sys/kernel/sched_granularity_ns. On CFS the value is not 
necessarily the timeslice length you will observe - for example in your 
workload above the granularity is set to 5 msec, but your rescheduling 
rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in 
your workload produces a timeslice length of 6-7 msecs.

so to be totally 'fair' and get the same rescheduling 'granularity' you 
should probably lower CFS's sched_granularity_ns to 2 msecs.

 Last not least I'd like to add that at least on my system having X 
 niced to -19 does result in kind of erratic (for lack of a better 
 word) desktop behavior. I'll will reevaluate this with -v6 but for now 
 IMO nicing X to -19 is a regression at least on my machine despite the 
 claim that cfs doesn't suffer from it.

indeed with -19 the rescheduling limit is so high under CFS that it does 
not throttle X's scheduling rate enough and so it will make CFS behave 
as badly as other schedulers.

I retested this with -10 and it should work better with that. In -v6 i 
changed the default to -10 too.

 PS: Only learning how to test these things I'm happy to get pointed 
 out the shortcomings of what I tested above. Of course suggestions for 
 improvements are welcome.

your report was perfectly fine and useful. no visible regressions is 
valuable feedback too. [ In fact, such type of feedback is the one i 
find the easiest to resolve ;-) ]

Since you are running number-crunchers you might be able to give 
performacne feedback too: do you have any reliable 'performance metric' 
available for your number cruncher jobs (ops per minute, runtime, etc.) 
so that it would be possible to compare number-crunching performance of 
mainline to SD and to CFS as well? If that value is easy to get and 
reliable/stable enough to be meaningful. (And it would be nice to also 
establish some ballpark figure about how much noise there is in any 
performance metric, so that we can see whether any differences between 
schedulers are systematic or not.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cpufreq default governor

2007-04-24 Thread William Heimbigner
Question: is there some reason that kconfig does not allow for default 
governors of conservative/ondemand/powersave?
I'm not aware of any reason why one of those governors could not be used 
as default.


William Heimbigner
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523

2007-04-24 Thread Jiri Kosina
On Tue, 24 Apr 2007, Herbert Xu wrote:

  Hmm, *sigh*. I guess the patch below fixes the problem, but it is a 
  masterpiece in the field of ugliness. And I am not sure whether it is 
  completely correct either. Are there any immediate ideas for better 
  solution with respect to how struct sock locking works?
 Please cc such patches to netdev.  Thanks.

Hi Herbert,

well it's pretty much bluetooth-specific, and bluez-devel was CCed, but 
OK.

  diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
  index 71f5cfb..c5c93cd 100644
  --- a/net/bluetooth/hci_sock.c
  +++ b/net/bluetooth/hci_sock.c
  @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block 
  *this, unsigned long event,
 /* Detach sockets from device */
 read_lock(hci_sk_list.lock);
 sk_for_each(sk, node, hci_sk_list.head) {
  -   lock_sock(sk);
  +   if (in_atomic())
  +   bh_lock_sock(sk);
  +   else
  +   lock_sock(sk);
 
 This doesn't do what you think it does.  bh_lock_sock can still succeed
 even with lock_sock held by someone else.

I know, this was precisely the reason why I converted the bh_lock_sock() 
to lock_sock() here some time ago (as it was racy with 
l2cap_connect_cfm()).

 Does this need to occur immediately when an event occurs? If not I'd
 suggest moving this into a workqueue.

Will have to check whether this will be processed properly in time when 
going to suspend.

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Tejun Heo
Hello,

Kristen Carlson Accardi wrote:
  static unsigned int ata_print_id = 1;
 @@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device 
   }
   dev-cdb_len = (unsigned int) rc;
  
 + /*
 +  * check to see if this ATAPI device supports
 +  * Asynchronous Notification
 +  */
 + if ((ap-flags  ATA_FLAG_AN)  ata_id_has_AN(id))
 + {
 + /* issue SET feature command to turn this on */
 + rc = ata_dev_set_AN(dev);

Please don't store err_mask into int rc.  Please store it to a separate
err_mask variable and report it when printing error message.

 + if (rc) {
 + ata_dev_printk(dev, KERN_ERR,
 + unable to set AN\n);
 + rc = -EINVAL;

Wouldn't -EIO be more appropriate?

 + goto err_out_nosup;
 + }
 + dev-flags |= ATA_DFLAG_AN;
 + }
 +

Not NACKing.  Just notes for future improvements.  We need to be more
careful here.  ATA/ATAPI world is filled with braindamaged devices and I
bet there are devices which advertises it can do AN but chokes when AN
is enabled.

This should be handled similarly to ACPI failure.  Currently ACPI does
the following.

1. try once, if fail, record that ACPI failed.  return error to trigger
retry.
2. try again, if fail again, ignore error if possible (!FROZEN) and turn
off ACPI.

This fallback mechanism for optional features can probably be
generalized and used for both ACPI and AN.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Alan Cox
 + /*
 +  * check to see if this ATAPI device supports
 +  * Asynchronous Notification
 +  */
 + if ((ap-flags  ATA_FLAG_AN)  ata_id_has_AN(id))
 + {

Bracketing police ^^^

 + /* issue SET feature command to turn this on */
 + rc = ata_dev_set_AN(dev);
 + if (rc) {
 + ata_dev_printk(dev, KERN_ERR,
 + unable to set AN\n);
 + rc = -EINVAL;
 + goto err_out_nosup;

How fatal is this - do we need to ignore the device at this point or
should we just pretend (possibly correctly) that the device itself does
not support notification. 

 @@ -299,6 +305,8 @@ struct ata_taskfile {
  #define ata_id_queue_depth(id)   (((id)[75]  0x1f) + 1)
  #define ata_id_removeable(id)((id)[0]  (1  7))
  #define ata_id_has_dword_io(id)  ((id)[50]  (1  0))
 +#define ata_id_has_AN(id)\
 + ((id[76]  (~id[76]))  ((id)[78]  (1  5)))

Might be nice to check ATA version as well to be paranoid but this all
looks ok as its a reserved field since way back when.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/7] genhd: expose AN to user space

2007-04-24 Thread Tejun Heo
Kristen Carlson Accardi wrote:
 +static struct disk_attribute disk_attr_capability = {
 + .attr = {.name = capability_flags, .mode = S_IRUGO },
 + .show   = disk_capability_read
 +};

How about just capability?  I think that would be more consistent with
other attributes.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/7] libata: send event when AN received

2007-04-24 Thread Alan Cox
 + /* check the 'N' bit in word 0 of the FIS */
 + if (f[0]  (1  15)) {
 + int port_addr =  ((f[0]  0x0f00)  8);
 + struct ata_device *adev = ap-device[port_addr];

You can't be sure that the port_addr returned will be in range if a
device is malfunctioning...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [mmc] alternative TI FM MMC/SD driver for 2.6.21-rc7

2007-04-24 Thread Sergey Yanovich

Hi,

If you add support for let's say [tifm_8xx2] in the future, which
would have port offsets different that [tifm_7xx1], you would also need a
completely new modules for slots (sd, ms, etc).



Does not this constitutes an unbounded speculation?

Only time will tell :)

And then, what would you propose to do with
adapters that have SD support disabled? There are quite a few of those in the 
wild, as of right
now (SD support is provided by bundled SDHCI on such systems, if at all). 
Similar argument goes
for other media types as well - many controllers have xD support disabled too 
(I think you have
one of those - Sony really values its customers). After all, it is not healthy 
to have dead code
in the kernel.


A typical kernel config is an allmconfig, which has tones of dead
code: just see a 'General setup' part of your distro '.config'.
There are item like 'SMP' selected by default for 686+ CPUs. And
this is far more overhead that a single check of card type on
insert.

To allow customization, boolean module options that disable certain
card type may suffice.

And again, you are doing a great work with the driver.

--
Sergey Yanovich
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm take2] 64bit-futex - provide new commands instead of new syscall

2007-04-24 Thread Pierre Peiffer

Ulrich Drepper a écrit :


It looks mostly good.  I wouldn't use the high bit to differentiate
the 64-bit operations, though.  Since we do not allow to apply it to
all operations the only effect will be that the compiler has a harder
time generating the code for the switch statement.  If you use
continuous values a simple jump table can be used and no conditionals.
Smaller and faster.



Something like that may be...

Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]


--
Pierre
---
 include/asm-ia64/futex.h|8 -
 include/asm-powerpc/futex.h |6 -
 include/asm-s390/futex.h|8 -
 include/asm-sparc64/futex.h |8 -
 include/asm-um/futex.h  |9 -
 include/asm-x86_64/futex.h  |   86 --
 include/asm-x86_64/unistd.h |2 
 include/linux/futex.h   |6 +
 include/linux/syscalls.h|3 
 kernel/futex.c  |  203 ++--
 kernel/futex_compat.c   |2 
 kernel/sys_ni.c |1 
 12 files changed, 95 insertions(+), 247 deletions(-)

Index: b/include/asm-ia64/futex.h
===
--- a/include/asm-ia64/futex.h
+++ b/include/asm-ia64/futex.h
@@ -124,13 +124,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* _ASM_FUTEX_H */
Index: b/include/asm-powerpc/futex.h
===
--- a/include/asm-powerpc/futex.h
+++ b/include/asm-powerpc/futex.h
@@ -119,11 +119,5 @@ futex_atomic_cmpxchg_inatomic64(u64 __us
 	return 0;
 }
 
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
-}
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_FUTEX_H */
Index: b/include/asm-s390/futex.h
===
--- a/include/asm-s390/futex.h
+++ b/include/asm-s390/futex.h
@@ -51,13 +51,7 @@ static inline int futex_atomic_cmpxchg_i
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* __KERNEL__ */
Index: b/include/asm-sparc64/futex.h
===
--- a/include/asm-sparc64/futex.h
+++ b/include/asm-sparc64/futex.h
@@ -108,13 +108,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* !(_SPARC64_FUTEX_H) */
Index: b/include/asm-um/futex.h
===
--- a/include/asm-um/futex.h
+++ b/include/asm-um/futex.h
@@ -6,14 +6,7 @@
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
+	return -ENOSYS;
 }
 
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
-}
-
-
 #endif
Index: b/include/asm-x86_64/futex.h
===
--- a/include/asm-x86_64/futex.h
+++ b/include/asm-x86_64/futex.h
@@ -41,38 +41,6 @@
 	  =r (tem)		\
 	: r (oparg), i (-EFAULT), m (*uaddr), 1 (0))
 
-#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
-  __asm__ __volatile (		\
-1:	 insn \n		\
-2:	.section .fixup,\ax\\n\
-3:	movq	%3, %1\n\
-	jmp	2b\n\
-	.previous\n\
-	.section __ex_table,\a\\n\
-	.align	8\n\
-	.quad	1b,3b\n\
-	.previous		\
-	: =r (oldval), =r (ret), =m (*uaddr)		\
-	: i (-EFAULT), m (*uaddr), 0 (oparg), 1 (0))
-
-#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
-  __asm__ __volatile (		\
-1:	movq	%2, %0\n\
-	movq	%0, %3\n	\
-	insn \n		\
-2:	 LOCK_PREFIX cmpxchgq %3, %2\n\
-	jnz	1b\n\
-3:	.section .fixup,\ax\\n\
-4:	movq	%5, %1\n\
-	jmp	3b\n\
-	.previous\n\
-	.section __ex_table,\a\\n\
-	.align	8\n\
-	.quad	1b,4b,2b,4b\n\
-	.previous		\
-	: =a (oldval), =r (ret), =m (*uaddr),		\
-	  =r (tem)		\
-	: r (oparg), i (-EFAULT), m (*uaddr), 1 (0))
 
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
@@ -128,60 +96,6 @@ futex_atomic_op_inuser (int encoded_op, 
 }
 
 static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	int op = (encoded_op  28)  7;
-	int cmp = (encoded_op  24)  15;
-	u64 oparg = (encoded_op  8)  20;
-	u64 cmparg = (encoded_op  20)  20;
-	u64 oldval = 0, ret, tem;
-
-	if (encoded_op  (FUTEX_OP_OPARG_SHIFT  28))
-		oparg = 1  oparg;
-
-	if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64)))
-		return -EFAULT;
-
-	

Re: 2.6.21-rc6-mm1

2007-04-24 Thread J.A. Magallón
On Sun, 8 Apr 2007 14:35:59 -0700, Andrew Morton [EMAIL PROTECTED] wrote:

 
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/
 
 
 - Lots of x86 updates
 

Has somthing related with PTY's changed in this kernel ?
I have to enable legacy PTY handling in a couple boxes to get ssh working.
If not, I had openpty() errors and nor sshd nor virtual terminals (aterm) were
able to get a terminal.

User space (udev) is the same in three boxes and one works and two fail.
I had /dev/ptmx everywhere and /dev/pts mounted

Any idea ?
TIA

--
J.A. Magallon jamagallon()ono!com \   Software is like sex:
 \ It's better when it's free
Mandriva Linux release 2008.0 (Cooker) for i586
Linux 2.6.20-jam10 (gcc 4.1.2 20070302 (prerelease) (4.1.2-1mdv2007.1)) #1 SMP 
PREEMPT
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: PageLRU can be non-atomic bit operation

2007-04-24 Thread Hisashi Hifumi


At 11:47 07/04/24, Nick Piggin wrote:

As Hugh points out, we must have atomic ops here, so changing the generic
code to use the __ version is wrong. However if there is a faster way that
i386 can perform the atomic variant, then doing so will speed up the generic
code without breaking other architectures.


Do you mean writing page-flags.h specific for i386 so improving generic code
and without breaking other architectures ?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH -mm take4 2/6] support multiple logging

2007-04-24 Thread Keiichi KII

On Fri, 20 Apr 2007 18:51:13 +0900
Keiichi KII [EMAIL PROTECTED] wrote:


I started to do some cleanups and fixups here, but abandoned it when it was
all getting a bit large.

Here are some fixes against this patch:
I'm going to fix my patches by following your reviews and send new patches 
on the LKML and the netdev ML in a few days.




Well..  before you can finish this work we need to decide upon what the
interface to userspace will be.

- The miscdev isn't appropriate



Why isn't miscdev appropriate? 
We just shouldn't use miscdev for networking conventionally?


--
Keiichi KII
NEC Corporation OSS Promotion Center
E-mail: [EMAIL PROTECTED]





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH -mm take4 2/6] support multiple logging

2007-04-24 Thread Keiichi KII

We don't really have anything that corresponds to netpoll's
connections at higher levels.

I'm tempted to say we should make this work more like the dummy
network device. ie:

modprobe netconsole -o netcon1 [params]
modprobe netconsole -o netcon2 [params]


The configuration of netconsole's looks like the configuration of routes.
Granted you probably have more routes than netconsoles, but the interface
issues are similar.  Netlink with a small application wouldn't be nice.
And having /proc/net/netconsole (read-only) would be good for the netlink
impaired.


Do you say that we had better use procfs instead of sysfs to show the 
configurations of netconsole?


If so, I have a question.
I thought that procfs use things related to process as far as possible.
Is it no problem to use procfs here? 


--
Keiichi KII
NEC Corporation OSS Promotion Center
E-mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/15] CFQ IO scheduler patch series

2007-04-24 Thread Jens Axboe
Hi,

I have a series of patches for the CFQ IO scheduler that I'd like to get
some more testing on. The patch series is also scheduled to enter the
next -mm, but if I'd like people to consciously give it a spin on its
own as well. The patches are also available from the 'cfq' branch of the
block layer tree:

git://git.kernel.dk/data/git/linux-2.6-block.git

and I've uploaded a rolled up version here as well:

http://brick.kernel.dk/snaps/cfq-update-20070424

The patch series is essentially a series of cleanups and smaller
optimizations, but there's also a larger change in there (patches 4 to
7) that completely rework how CFQ selects which queue to process. It's
an experimental approach similar to the CFS CPU scheduler, in which
management lists are converted to a single rbtree instead.

So give it a spin if you have the time, and let me know how it performs
and/or feels for you workload and hardware.

 cfq-iosched.c |  676 ++
 1 file changed, 357 insertions(+), 319 deletions(-)

-- 
Jens Axboe



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/15] cfq-iosched: improve preemption for cooperating tasks

2007-04-24 Thread Jens Axboe
When testing the syslet async io approach, I discovered that CFQ
sometimes didn't perform as well as expected. cfq_should_preempt()
needs to better check for cooperating tasks, so fix that by allowing
preemption of an equal priority queue if the recently queued request
is as good a candidate for IO as the one we are currently waiting for.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   26 --
 1 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9e37971..a683d00 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -861,15 +861,11 @@ static int cfq_arm_slice_timer(struct cfq_data *cfqd)
 
 static void cfq_dispatch_insert(request_queue_t *q, struct request *rq)
 {
-   struct cfq_data *cfqd = q-elevator-elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
cfq_remove_request(rq);
cfqq-on_dispatch[rq_is_sync(rq)]++;
elv_dispatch_sort(q, rq);
-
-   rq = list_entry(q-queue_head.prev, struct request, queuelist);
-   cfqd-last_sector = rq-sector + rq-nr_sectors;
 }
 
 /*
@@ -1579,6 +1575,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct 
cfq_queue *new_cfqq,
   struct request *rq)
 {
struct cfq_queue *cfqq = cfqd-active_queue;
+   sector_t dist;
 
if (cfq_class_idle(new_cfqq))
return 0;
@@ -1588,14 +1585,14 @@ cfq_should_preempt(struct cfq_data *cfqd, struct 
cfq_queue *new_cfqq,
 
if (cfq_class_idle(cfqq))
return 1;
-   if (!cfq_cfqq_wait_request(new_cfqq))
-   return 0;
+
/*
 * if the new request is sync, but the currently running queue is
 * not, let the sync request have priority.
 */
if (rq_is_sync(rq)  !cfq_cfqq_sync(cfqq))
return 1;
+
/*
 * So both queues are sync. Let the new request get disk time if
 * it's a metadata request and the current queue is doing regular IO.
@@ -1603,6 +1600,21 @@ cfq_should_preempt(struct cfq_data *cfqd, struct 
cfq_queue *new_cfqq,
if (rq_is_meta(rq)  !cfqq-meta_pending)
return 1;
 
+   if (!cfqd-active_cic || !cfq_cfqq_wait_request(cfqq))
+   return 0;
+
+   /*
+* if this request is as-good as one we would expect from the
+* current cfqq, let it preempt
+*/
+   if (rq-sector  cfqd-last_sector)
+   dist = rq-sector - cfqd-last_sector;
+   else
+   dist = cfqd-last_sector - rq-sector;
+
+   if (dist = cfqd-active_cic-seek_mean)
+   return 1;
+
return 0;
 }
 
@@ -1719,6 +1731,8 @@ static void cfq_completed_request(request_queue_t *q, 
struct request *rq)
cfqq-on_dispatch[sync]--;
cfqq-service_last = now;
 
+   cfqd-last_sector = rq-hard_sector + rq-hard_nr_sectors;
+
if (!cfq_class_idle(cfqq))
cfqd-last_end_request = now;
 
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/15] cfq-iosched: development update

2007-04-24 Thread Jens Axboe
- Implement logic for detecting cooperating processes, so we
  choose the best available queue whenever possible.

- Improve residual slice time accounting.

- Remove dead code: we no longer see async requests coming in on
  sync queues. That part was removed a long time ago. That means
  that we can also remove the difference between cfq_cfqq_sync()
  and cfq_cfqq_class_sync(), they are now indentical. And we can
  kill the on_dispatch array, just make it a counter.

- Allow a process to go into the current list, if it hasn't been
  serviced in this scheduler tick yet.

Possible future improvements including caching the cfqq lookup
in cfq_close_cooperator(), so we don't have to look it up twice.
cfq_get_best_queue() should just use that last decision instead
of doing it again.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |  381 +++
 1 files changed, 261 insertions(+), 120 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a683d00..3883ba8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -56,13 +56,7 @@ static struct completion *ioc_gone;
 #define ASYNC  (0)
 #define SYNC   (1)
 
-#define cfq_cfqq_dispatched(cfqq)  \
-   ((cfqq)-on_dispatch[ASYNC] + (cfqq)-on_dispatch[SYNC])
-
-#define cfq_cfqq_class_sync(cfqq)  ((cfqq)-key != CFQ_KEY_ASYNC)
-
-#define cfq_cfqq_sync(cfqq)\
-   (cfq_cfqq_class_sync(cfqq) || (cfqq)-on_dispatch[SYNC])
+#define cfq_cfqq_sync(cfqq)((cfqq)-key != CFQ_KEY_ASYNC)
 
 #define sample_valid(samples)  ((samples)  80)
 
@@ -79,6 +73,7 @@ struct cfq_data {
struct list_head busy_rr;
struct list_head cur_rr;
struct list_head idle_rr;
+   unsigned long cur_rr_tick;
unsigned int busy_queues;
 
/*
@@ -98,11 +93,12 @@ struct cfq_data {
struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
int cur_prio, cur_end_prio;
+   unsigned long prio_time;
unsigned int dispatch_slice;
 
struct timer_list idle_class_timer;
 
-   sector_t last_sector;
+   sector_t last_position;
unsigned long last_end_request;
 
/*
@@ -117,6 +113,9 @@ struct cfq_data {
unsigned int cfq_slice_idle;
 
struct list_head cic_list;
+
+   sector_t new_seek_mean;
+   u64 new_seek_total;
 };
 
 /*
@@ -133,6 +132,8 @@ struct cfq_queue {
unsigned int key;
/* member of the rr/busy/cur/idle cfqd list */
struct list_head cfq_list;
+   /* in what tick we were last serviced */
+   unsigned long rr_tick;
/* sorted list of pending requests */
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
@@ -148,10 +149,11 @@ struct cfq_queue {
 
unsigned long slice_end;
unsigned long service_last;
+   unsigned long slice_start;
long slice_resid;
 
-   /* number of requests that are on the dispatch list */
-   int on_dispatch[2];
+   /* number of requests that are on the dispatch list or inside driver */
+   int dispatched;
 
/* io prio of this group */
unsigned short ioprio, org_ioprio;
@@ -159,6 +161,8 @@ struct cfq_queue {
 
/* various state flags, see below */
unsigned int flags;
+
+   sector_t last_request_pos;
 };
 
 enum cfqq_state_flags {
@@ -259,6 +263,8 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue 
*cfqq)
 * easily introduce oscillations.
 */
cfqq-slice_resid = 0;
+
+   cfqq-slice_start = jiffies;
 }
 
 /*
@@ -307,7 +313,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, 
struct request *rq2)
s1 = rq1-sector;
s2 = rq2-sector;
 
-   last = cfqd-last_sector;
+   last = cfqd-last_position;
 
/*
 * by definition, 1KiB is 2 sectors
@@ -398,39 +404,42 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
return cfq_choose_req(cfqd, next, prev);
 }
 
-static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted)
+/*
+ * This function finds out where to insert a BE queue in the service hierarchy
+ */
+static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+   int preempted)
 {
-   struct cfq_data *cfqd = cfqq-cfqd;
struct list_head *list, *n;
struct cfq_queue *__cfqq;
+   int add_tail = 0;
 
/*
-* Resorting requires the cfqq to be on the RR list already.
+* if cfqq has requests in flight, don't allow it to be
+* found in cfq_set_active_queue before it has finished them.
+* this is done to increase fairness between a process that
+* has lots of io pending vs one that only generates one
+* sporadically or synchronously
 */
-   if (!cfq_cfqq_on_rr(cfqq))
-   return;
-
-   

Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Michael Gerdau
  What I also don't understand is the difference in load average, sd 
  constantly had higher values, the above figures are representative for 
  the whole log. I don't know which is better though.
 
 hm, it's hard from here to tell that. What load average does the vanilla 
 kernel report? I'd take that as a reference.

I will redo this test with sd-0.46, cfs-v5 and mainline later today.

 interesting - CFS has half the context-switch rate of SD. That is 
 probably because on your workload CFS defaults to longer 'timeslices' 
 than SD. You can influence the 'timeslice length' under SD via 
 /proc/sys/kernel/rr_interval (milliseconds units) and under CFS via 
 /proc/sys/kernel/sched_granularity_ns. On CFS the value is not 
 necessarily the timeslice length you will observe - for example in your 
 workload above the granularity is set to 5 msec, but your rescheduling 
 rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in 
 your workload produces a timeslice length of 6-7 msecs.
 
 so to be totally 'fair' and get the same rescheduling 'granularity' you 
 should probably lower CFS's sched_granularity_ns to 2 msecs.

I'll change default nice in cfs to -10.

I'm also happy to adjust /proc/sys/kernel/sched_granularity_ns to 2msec.
However checking /proc/sys/kernel/rr_interval reveals it is 16 (msec)
on my system.

Anyway, I'll have to do some urgent other work and won't be able to
do lots of testing until tonight (but then I will).

Best,
Michael
-- 
 Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar
 Sitz Hamburg; HRB 89145 Amtsgericht Hamburg
 Vote against SPAM - see http://www.politik-digital.de/spam/
 Michael Gerdau   email: [EMAIL PROTECTED]
 GPG-keys available on request or at public keyserver


pgpfJX2s3TRBz.pgp
Description: PGP signature


Re: [RFC] another scheduler beater

2007-04-24 Thread Ingo Molnar

* Bill Davidsen [EMAIL PROTECTED] wrote:

 The small attached script does a nice job of showing animation 
 glitches in the glxgears animation. I have run one set of tests, and 
 will have several more tomorrow. I'm off to a poker game, and would 
 like to let people draw their own conclusions.
 
 Based on just this script as load I would say renice on X isn't a good 
 thing. Based on one small test, I would say that renice of X in 
 conjunction with heavy disk i/o and a single fast scrolling xterm 
 (think kernel compile) seems to slow the raid6 thread measurably. 
 Results late tomorrow, it will be an early and long day :-(

hm, i'm wondering what you would expect the scheduler to do here?

for this particular test you'll get the best result by renicing X to 
+19! Why? Because, as far as i can see this is a partially 'inverted' 
test of X's scheduling.

While the script is definitely useful (you taught me that nice xterm 
-geom trick to automate the placing of busy xterms :), some caveats do 
apply when interpreting the results:

If you have a kernel 3D driver (which you seem to have, judging by the 
glxgears numbers you are getting) then running 'glxgears' wont involve X 
at all. glxgears just gets its own window and then the kernel driver 
draws straight into it, without any side-trips to X. You can see this 
for yourself by starting glitch1.sh from an ssh terminal, and then 
_totally stop_ the X server via kill -STOP 12345 - all the xterms will 
stop, the X desktop freezes, but the glxgears instance will still 
happily draw its stuff and wheels are happily turning on the screen.

So in this sense glxgears is a 'CPU hog' workload, largely independent 
of X.

now, by renicing X to -10 and running the xterms you'll definitely hurt 
CPU hogs - even if it happens to be a glxgears process that draws 3D 
graphics in a window provided by X. But this is precisely what is 
supposed to happen in this case. You should get the best glxgears 
performance by renicing X to _+19_, and that seems to be happening 
according to your numbers - and that's what happens in my own testing 
too.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/15] cfq-iosched: minor updates

2007-04-24 Thread Jens Axboe
- Move the queue_new flag clear to when the queue is selected
- Only select the non-first queue in cfq_get_best_queue(), if there's
  a substantial difference between the best and first.
- Get rid of -busy_rr
- Only select a close cooperator, if the current queue is known to take
  a while to think.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   81 +++---
 1 files changed, 18 insertions(+), 63 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3883ba8..04fea76 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -70,7 +70,6 @@ struct cfq_data {
 * rr list of queues with requests and the count of them
 */
struct list_head rr_list[CFQ_PRIO_LISTS];
-   struct list_head busy_rr;
struct list_head cur_rr;
struct list_head idle_rr;
unsigned long cur_rr_tick;
@@ -410,59 +409,18 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
 static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq,
int preempted)
 {
-   struct list_head *list, *n;
-   struct cfq_queue *__cfqq;
-   int add_tail = 0;
-
-   /*
-* if cfqq has requests in flight, don't allow it to be
-* found in cfq_set_active_queue before it has finished them.
-* this is done to increase fairness between a process that
-* has lots of io pending vs one that only generates one
-* sporadically or synchronously
-*/
-   if (cfqq-dispatched)
-   list = cfqd-busy_rr;
-   else if (cfqq-ioprio == (cfqd-cur_prio + 1) 
-cfq_cfqq_sync(cfqq) 
-(time_before(cfqd-prio_time, cfqq-service_last) ||
- cfq_cfqq_queue_new(cfqq) || preempted)) {
-   list = cfqd-cur_rr;
-   add_tail = 1;
-   } else
-   list = cfqd-rr_list[cfqq-ioprio];
-
-   if (!cfq_cfqq_sync(cfqq) || add_tail) {
-   /*
-* async queue always goes to the end. this wont be overly
-* unfair to writes, as the sort of the sync queue wont be
-* allowed to pass the async queue again.
-*/
-   list_add_tail(cfqq-cfq_list, list);
-   } else if (preempted || cfq_cfqq_queue_new(cfqq)) {
-   /*
-* If this queue was preempted or is new (never been serviced),
-* let it be added first for fairness but beind other new
-* queues.
-*/
-   n = list;
-   while (n-next != list) {
-   __cfqq = list_entry_cfqq(n-next);
-   if (!cfq_cfqq_queue_new(__cfqq))
-   break;
+   if (!cfq_cfqq_sync(cfqq))
+   list_add_tail(cfqq-cfq_list, cfqd-rr_list[cfqq-ioprio]);
+   else {
+   struct list_head *n = cfqd-rr_list[cfqq-ioprio];
 
-   n = n-next;
-   }
-   list_add(cfqq-cfq_list, n);
-   } else {
/*
 * sort by last service, but don't cross a new or async
-* queue. we don't cross a new queue because it hasn't been
-* service before, and we don't cross an async queue because
-* it gets added to the end on expire.
+* queue. we don't cross a new queue because it hasn't
+* been service before, and we don't cross an async
+* queue because it gets added to the end on expire.
 */
-   n = list;
-   while ((n = n-prev) != list) {
+   while ((n = n-prev) != cfqd-rr_list[cfqq-ioprio]) {
struct cfq_queue *__c = list_entry_cfqq(n);
 
if (!cfq_cfqq_sync(__c) || !__c-service_last)
@@ -719,6 +677,7 @@ __cfq_set_active_queue(struct cfq_data *cfqd, struct 
cfq_queue *cfqq)
cfq_clear_cfqq_must_alloc_slice(cfqq);
cfq_clear_cfqq_fifo_expire(cfqq);
cfq_mark_cfqq_slice_new(cfqq);
+   cfq_clear_cfqq_queue_new(cfqq);
cfqq-rr_tick = cfqd-cur_rr_tick;
}
 
@@ -737,7 +696,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
 
cfq_clear_cfqq_must_dispatch(cfqq);
cfq_clear_cfqq_wait_request(cfqq);
-   cfq_clear_cfqq_queue_new(cfqq);
 
/*
 * store what was left of this slice, if the queue idled out
@@ -839,13 +797,15 @@ static inline sector_t cfq_dist_from_last(struct cfq_data 
*cfqd,
 static struct cfq_queue *cfq_get_best_queue(struct cfq_data *cfqd)
 {
struct cfq_queue *cfqq = NULL, *__cfqq;
-   sector_t best = -1, dist;
+   sector_t best = -1, first = -1, dist;
 
list_for_each_entry(__cfqq, cfqd-cur_rr, cfq_list) {
if (!__cfqq-next_rq || 

[PATCH 12/15] cfq-iosched: get rid of -dispatch_slice

2007-04-24 Thread Jens Axboe
We can track it fairly accurately locally, let the slice handling
take care of the rest.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |6 +-
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b680002..8f76aed 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -106,7 +106,6 @@ struct cfq_data {
 
struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
-   unsigned int dispatch_slice;
 
struct timer_list idle_class_timer;
 
@@ -769,8 +768,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
put_io_context(cfqd-active_cic-ioc);
cfqd-active_cic = NULL;
}
-
-   cfqd-dispatch_slice = 0;
 }
 
 static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
@@ -1020,7 +1017,6 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 */
cfq_dispatch_insert(cfqd-queue, rq);
 
-   cfqd-dispatch_slice++;
dispatched++;
 
if (!cfqd-active_cic) {
@@ -1038,7 +1034,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 * queue always expire after 1 dispatch round.
 */
if (cfqd-busy_queues  1  ((!cfq_cfqq_sync(cfqq) 
-   cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) ||
+   dispatched = cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
cfqq-slice_end = jiffies + 1;
cfq_slice_expired(cfqd, 0);
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/15] cfq-iosched: speed up rbtree handling

2007-04-24 Thread Jens Axboe
For cases where the rbtree is mainly used for sorting and min retrieval,
a nice speedup of the rbtree code is to maintain a cache of the leftmost
node in the tree.

Also spotted in the CFS CPU scheduler code.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   62 +++---
 1 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ad29a99..7f964ee 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -70,6 +70,18 @@ static struct completion *ioc_gone;
 #define sample_valid(samples)  ((samples)  80)
 
 /*
+ * Most of our rbtree usage is for sorting with min extraction, so
+ * if we cache the leftmost node we don't have to walk down the tree
+ * to find it. Idea borrowed from Ingo Molnars CFS scheduler. We should
+ * move this into the elevator for the rq sorting as well.
+ */
+struct cfq_rb_root {
+   struct rb_root rb;
+   struct rb_node *left;
+};
+#define CFQ_RB_ROOT(struct cfq_rb_root) { RB_ROOT, NULL, }
+
+/*
  * Per block device queue structure
  */
 struct cfq_data {
@@ -78,7 +90,7 @@ struct cfq_data {
/*
 * rr list of queues with requests and the count of them
 */
-   struct rb_root service_tree;
+   struct cfq_rb_root service_tree;
struct list_head cur_rr;
struct list_head idle_rr;
unsigned int busy_queues;
@@ -378,6 +390,23 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, 
struct request *rq2)
}
 }
 
+static struct rb_node *cfq_rb_first(struct cfq_rb_root *root)
+{
+   if (root-left)
+   return root-left;
+
+   return rb_first(root-rb);
+}
+
+static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
+{
+   if (root-left == n)
+   root-left = NULL;
+
+   rb_erase(n, root-rb);
+   RB_CLEAR_NODE(n);
+}
+
 /*
  * would be nice to take fifo expire time into account as well
  */
@@ -417,10 +446,10 @@ static unsigned long cfq_slice_offset(struct cfq_data 
*cfqd,
 static void cfq_service_tree_add(struct cfq_data *cfqd,
struct cfq_queue *cfqq)
 {
-   struct rb_node **p = cfqd-service_tree.rb_node;
+   struct rb_node **p = cfqd-service_tree.rb.rb_node;
struct rb_node *parent = NULL;
-   struct cfq_queue *__cfqq;
unsigned long rb_key;
+   int left = 1;
 
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq-slice_resid;
@@ -433,22 +462,29 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
if (rb_key == cfqq-rb_key)
return;
 
-   rb_erase(cfqq-rb_node, cfqd-service_tree);
+   cfq_rb_erase(cfqq-rb_node, cfqd-service_tree);
}
 
while (*p) {
+   struct cfq_queue *__cfqq;
+
parent = *p;
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
if (rb_key  __cfqq-rb_key)
p = (*p)-rb_left;
-   else
+   else {
p = (*p)-rb_right;
+   left = 0;
+   }
}
 
+   if (left)
+   cfqd-service_tree.left = cfqq-rb_node;
+
cfqq-rb_key = rb_key;
rb_link_node(cfqq-rb_node, parent, p);
-   rb_insert_color(cfqq-rb_node, cfqd-service_tree);
+   rb_insert_color(cfqq-rb_node, cfqd-service_tree.rb);
 }
 
 static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted)
@@ -509,10 +545,8 @@ cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue 
*cfqq)
cfq_clear_cfqq_on_rr(cfqq);
list_del_init(cfqq-cfq_list);
 
-   if (!RB_EMPTY_NODE(cfqq-rb_node)) {
-   rb_erase(cfqq-rb_node, cfqd-service_tree);
-   RB_CLEAR_NODE(cfqq-rb_node);
-   }
+   if (!RB_EMPTY_NODE(cfqq-rb_node))
+   cfq_rb_erase(cfqq-rb_node, cfqd-service_tree);
 
BUG_ON(!cfqd-busy_queues);
cfqd-busy_queues--;
@@ -758,8 +792,8 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data 
*cfqd)
 * if current list is non-empty, grab first entry.
 */
cfqq = list_entry_cfqq(cfqd-cur_rr.next);
-   } else if (!RB_EMPTY_ROOT(cfqd-service_tree)) {
-   struct rb_node *n = rb_first(cfqd-service_tree);
+   } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) {
+   struct rb_node *n = cfq_rb_first(cfqd-service_tree);
 
cfqq = rb_entry(n, struct cfq_queue, rb_node);
} else if (!list_empty(cfqd-idle_rr)) {
@@ -1030,7 +1064,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
int dispatched = 0;
struct rb_node *n;
 
-   while ((n = rb_first(cfqd-service_tree)) != NULL) {
+   while ((n = cfq_rb_first(cfqd-service_tree)) != NULL) {
struct cfq_queue *cfqq = rb_entry(n, struct cfq_queue, rb_node);
 

[PATCH 8/15] cfq-iosched: style cleanups and comments

2007-04-24 Thread Jens Axboe
Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   66 ++
 1 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e6cc77f..f86ff4d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -222,7 +222,7 @@ CFQ_CFQQ_FNS(slice_new);
 
 static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *, unsigned int, 
unsigned short);
 static void cfq_dispatch_insert(request_queue_t *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, unsigned int 
key, struct task_struct *tsk, gfp_t gfp_mask);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, unsigned int, struct 
task_struct *, gfp_t);
 
 /*
  * scheduler run of queue, if there are requests pending and no one in the
@@ -389,6 +389,9 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, 
struct request *rq2)
}
 }
 
+/*
+ * The below is leftmost cache rbtree addon
+ */
 static struct rb_node *cfq_rb_first(struct cfq_rb_root *root)
 {
if (root-left)
@@ -442,13 +445,18 @@ static unsigned long cfq_slice_offset(struct cfq_data 
*cfqd,
return ((cfqd-busy_queues - 1) * cfq_prio_slice(cfqd, 1, 0));
 }
 
+/*
+ * The cfqd-service_tree holds all pending cfq_queue's that have
+ * requests waiting to be processed. It is sorted in the order that
+ * we will service the queues.
+ */
 static void cfq_service_tree_add(struct cfq_data *cfqd,
struct cfq_queue *cfqq)
 {
struct rb_node **p = cfqd-service_tree.rb.rb_node;
struct rb_node *parent = NULL;
unsigned long rb_key;
-   int left = 1;
+   int left;
 
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq-slice_resid;
@@ -464,6 +472,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
cfq_rb_erase(cfqq-rb_node, cfqd-service_tree);
}
 
+   left = 1;
while (*p) {
struct cfq_queue *__cfqq;
struct rb_node **n;
@@ -503,17 +512,16 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
rb_insert_color(cfqq-rb_node, cfqd-service_tree.rb);
 }
 
+/*
+ * Update cfqq's position in the service tree.
+ */
 static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted)
 {
-   struct cfq_data *cfqd = cfqq-cfqd;
-
/*
 * Resorting requires the cfqq to be on the RR list already.
 */
-   if (!cfq_cfqq_on_rr(cfqq))
-   return;
-
-   cfq_service_tree_add(cfqd, cfqq);
+   if (cfq_cfqq_on_rr(cfqq))
+   cfq_service_tree_add(cfqq-cfqd, cfqq);
 }
 
 /*
@@ -530,6 +538,10 @@ cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue 
*cfqq)
cfq_resort_rr_list(cfqq, 0);
 }
 
+/*
+ * Called when the cfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
 static inline void
 cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
@@ -648,8 +660,7 @@ static void cfq_remove_request(struct request *rq)
}
 }
 
-static int
-cfq_merge(request_queue_t *q, struct request **req, struct bio *bio)
+static int cfq_merge(request_queue_t *q, struct request **req, struct bio *bio)
 {
struct cfq_data *cfqd = q-elevator-elevator_data;
struct request *__rq;
@@ -775,6 +786,10 @@ static inline void cfq_slice_expired(struct cfq_data 
*cfqd, int preempted,
__cfq_slice_expired(cfqd, cfqq, preempted, timed_out);
 }
 
+/*
+ * Get next queue for service. Unless we have a queue preemption,
+ * we'll simply select the first cfqq in the service tree.
+ */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
struct cfq_queue *cfqq = NULL;
@@ -786,10 +801,11 @@ static struct cfq_queue *cfq_get_next_queue(struct 
cfq_data *cfqd)
cfqq = list_entry_cfqq(cfqd-cur_rr.next);
} else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) {
struct rb_node *n = cfq_rb_first(cfqd-service_tree);
-   unsigned long end;
 
cfqq = rb_entry(n, struct cfq_queue, rb_node);
if (cfq_class_idle(cfqq)) {
+   unsigned long end;
+
/*
 * if we have idle queues and no rt or be queues had
 * pending requests, either allow immediate service if
@@ -807,6 +823,9 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data 
*cfqd)
return cfqq;
 }
 
+/*
+ * Get and set a new active queue for service.
+ */
 static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd)
 {
struct cfq_queue *cfqq;
@@ -892,6 +911,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
mod_timer(cfqd-idle_slice_timer, jiffies + sl);
 }
 
+/*
+ * Move request from internal lists to the request queue dispatch list.
+ */
 static void cfq_dispatch_insert(request_queue_t *q, struct request *rq)
 {
   

[PATCH 14/15] cfq-iosched: improve sync vs async workloads

2007-04-24 Thread Jens Axboe
Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   31 ++-
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f920527..772df89 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -96,6 +96,7 @@ struct cfq_data {
struct hlist_head *cfq_hash;
 
int rq_in_driver;
+   int sync_flight;
int hw_tag;
 
/*
@@ -905,11 +906,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(request_queue_t *q, struct request *rq)
 {
+   struct cfq_data *cfqd = q-elevator-elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
cfq_remove_request(rq);
cfqq-dispatched++;
elv_dispatch_sort(q, rq);
+
+   if (cfq_cfqq_sync(cfqq))
+   cfqd-sync_flight++;
 }
 
 /*
@@ -1094,27 +1099,24 @@ static int cfq_dispatch_requests(request_queue_t *q, 
int force)
while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
int max_dispatch;
 
-   if (cfqd-busy_queues  1) {
-   /*
-* So we have dispatched before in this round, if the
-* next queue has idling enabled (must be sync), don't
-* allow it service until the previous have completed.
-*/
-   if (cfqd-rq_in_driver  cfq_cfqq_idle_window(cfqq) 
-   dispatched)
+   max_dispatch = cfqd-cfq_quantum;
+   if (cfq_class_idle(cfqq))
+   max_dispatch = 1;
+
+   if (cfqq-dispatched = max_dispatch) {
+   if (cfqd-busy_queues  1)
break;
-   if (cfqq-dispatched = cfqd-cfq_quantum)
+   if (cfqq-dispatched = 4 * max_dispatch)
break;
}
 
+   if (cfqd-sync_flight  !cfq_cfqq_sync(cfqq))
+   break;
+
cfq_clear_cfqq_must_dispatch(cfqq);
cfq_clear_cfqq_wait_request(cfqq);
del_timer(cfqd-idle_slice_timer);
 
-   max_dispatch = cfqd-cfq_quantum;
-   if (cfq_class_idle(cfqq))
-   max_dispatch = 1;
-
dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
}
 
@@ -1767,6 +1769,9 @@ static void cfq_completed_request(request_queue_t *q, 
struct request *rq)
cfqd-rq_in_driver--;
cfqq-dispatched--;
 
+   if (cfq_cfqq_sync(cfqq))
+   cfqd-sync_flight--;
+
if (!cfq_class_idle(cfqq))
cfqd-last_end_request = now;
 
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/15] cfq-iosched: rework the whole round-robin list concept

2007-04-24 Thread Jens Axboe
Drawing on some inspiration from the CFS CPU scheduler design, overhaul
the pending cfq_queue concept list management. Currently CFQ uses a
doubly linked list per priority level for sorting and service uses.
Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
to service them.

This unfortunately means that the ionice levels aren't as strong
anymore, will work on improving those later. We only scale the slice
time now, not the number of times we service. This means that latency
is better (for all priority levels), but that the distinction between
the highest and lower levels aren't as big.

The diffstat speaks for itself.

 cfq-iosched.c |  363 +-
 1 file changed, 125 insertions(+), 238 deletions(-)

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |  361 +-
 1 files changed, 123 insertions(+), 238 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 04fea76..ad29a99 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -26,7 +26,16 @@ static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
+/*
+ * grace period before allowing idle class to get disk access
+ */
 #define CFQ_IDLE_GRACE (HZ / 10)
+
+/*
+ * below this threshold, we consider thinktime immediate
+ */
+#define CFQ_MIN_TT (2)
+
 #define CFQ_SLICE_SCALE(5)
 
 #define CFQ_KEY_ASYNC  (0)
@@ -69,10 +78,9 @@ struct cfq_data {
/*
 * rr list of queues with requests and the count of them
 */
-   struct list_head rr_list[CFQ_PRIO_LISTS];
+   struct rb_root service_tree;
struct list_head cur_rr;
struct list_head idle_rr;
-   unsigned long cur_rr_tick;
unsigned int busy_queues;
 
/*
@@ -91,8 +99,6 @@ struct cfq_data {
 
struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
-   int cur_prio, cur_end_prio;
-   unsigned long prio_time;
unsigned int dispatch_slice;
 
struct timer_list idle_class_timer;
@@ -131,8 +137,10 @@ struct cfq_queue {
unsigned int key;
/* member of the rr/busy/cur/idle cfqd list */
struct list_head cfq_list;
-   /* in what tick we were last serviced */
-   unsigned long rr_tick;
+   /* service_tree member */
+   struct rb_node rb_node;
+   /* service_tree key */
+   unsigned long rb_key;
/* sorted list of pending requests */
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
@@ -147,8 +155,6 @@ struct cfq_queue {
struct list_head fifo;
 
unsigned long slice_end;
-   unsigned long service_last;
-   unsigned long slice_start;
long slice_resid;
 
/* number of requests that are on the dispatch list or inside driver */
@@ -240,30 +246,26 @@ static inline pid_t cfq_queue_pid(struct task_struct 
*task, int rw, int is_sync)
  * if a queue is marked sync and has sync io queued. A sync queue with async
  * io only, should not get full sync slice length.
  */
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
+unsigned short prio)
 {
-   const int base_slice = cfqd-cfq_slice[cfq_cfqq_sync(cfqq)];
+   const int base_slice = cfqd-cfq_slice[sync];
 
-   WARN_ON(cfqq-ioprio = IOPRIO_BE_NR);
+   WARN_ON(prio = IOPRIO_BE_NR);
+
+   return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+}
 
-   return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - cfqq-ioprio));
+static inline int
+cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+   return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq-ioprio);
 }
 
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
cfqq-slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-   cfqq-slice_end += cfqq-slice_resid;
-
-   /*
-* Don't carry over residual for more than one slice, we only want
-* to slightly correct the fairness. Carrying over forever would
-* easily introduce oscillations.
-*/
-   cfqq-slice_resid = 0;
-
-   cfqq-slice_start = jiffies;
 }
 
 /*
@@ -403,33 +405,50 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
return cfq_choose_req(cfqd, next, prev);
 }
 
-/*
- * This function finds out where to insert a BE queue in the service hierarchy
- */
-static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-   int preempted)
+static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq)
 {
-   if (!cfq_cfqq_sync(cfqq))
-   list_add_tail(cfqq-cfq_list, 

[PATCH 10/15] cfq-iosched: get rid of -cur_rr and -cfq_list

2007-04-24 Thread Jens Axboe
It's only used for preemption now that the IDLE and RT queues also
use the rbtree. If we pass an 'add_front' variable to
cfq_service_tree_add(), we can set -rb_key to 0 to force insertion
at the front of the tree.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   87 +++
 1 files changed, 32 insertions(+), 55 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 251131a..2d0e9c5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -45,9 +45,6 @@ static int cfq_slice_idle = HZ / 125;
  */
 #define CFQ_QHASH_SHIFT6
 #define CFQ_QHASH_ENTRIES  (1  CFQ_QHASH_SHIFT)
-#define list_entry_qhash(entry)hlist_entry((entry), struct cfq_queue, 
cfq_hash)
-
-#define list_entry_cfqq(ptr)   list_entry((ptr), struct cfq_queue, cfq_list)
 
 #define RQ_CIC(rq) ((struct cfq_io_context*)(rq)-elevator_private)
 #define RQ_CFQQ(rq)((rq)-elevator_private2)
@@ -91,7 +88,6 @@ struct cfq_data {
 * rr list of queues with requests and the count of them
 */
struct cfq_rb_root service_tree;
-   struct list_head cur_rr;
unsigned int busy_queues;
 
/*
@@ -146,8 +142,6 @@ struct cfq_queue {
struct hlist_node cfq_hash;
/* hash key */
unsigned int key;
-   /* member of the rr/busy/cur/idle cfqd list */
-   struct list_head cfq_list;
/* service_tree member */
struct rb_node rb_node;
/* service_tree key */
@@ -452,16 +446,19 @@ static unsigned long cfq_slice_offset(struct cfq_data 
*cfqd,
  * we will service the queues.
  */
 static void cfq_service_tree_add(struct cfq_data *cfqd,
-   struct cfq_queue *cfqq)
+   struct cfq_queue *cfqq, int add_front)
 {
struct rb_node **p = cfqd-service_tree.rb.rb_node;
struct rb_node *parent = NULL;
unsigned long rb_key;
int left;
 
-   rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-   rb_key += cfqq-slice_resid;
-   cfqq-slice_resid = 0;
+   if (!add_front) {
+   rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
+   rb_key += cfqq-slice_resid;
+   cfqq-slice_resid = 0;
+   } else
+   rb_key = 0;
 
if (!RB_EMPTY_NODE(cfqq-rb_node)) {
/*
@@ -516,13 +513,13 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
 /*
  * Update cfqq's position in the service tree.
  */
-static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted)
+static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
/*
 * Resorting requires the cfqq to be on the RR list already.
 */
if (cfq_cfqq_on_rr(cfqq))
-   cfq_service_tree_add(cfqq-cfqd, cfqq);
+   cfq_service_tree_add(cfqd, cfqq, 0);
 }
 
 /*
@@ -536,7 +533,7 @@ cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue 
*cfqq)
cfq_mark_cfqq_on_rr(cfqq);
cfqd-busy_queues++;
 
-   cfq_resort_rr_list(cfqq, 0);
+   cfq_resort_rr_list(cfqd, cfqq);
 }
 
 /*
@@ -548,7 +545,6 @@ cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue 
*cfqq)
 {
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);
-   list_del_init(cfqq-cfq_list);
 
if (!RB_EMPTY_NODE(cfqq-rb_node))
cfq_rb_erase(cfqq-rb_node, cfqd-service_tree);
@@ -765,7 +761,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
if (timed_out  !cfq_cfqq_slice_new(cfqq))
cfqq-slice_resid = cfqq-slice_end - jiffies;
 
-   cfq_resort_rr_list(cfqq, preempted);
+   cfq_resort_rr_list(cfqd, cfqq);
 
if (cfqq == cfqd-active_queue)
cfqd-active_queue = NULL;
@@ -793,31 +789,28 @@ static inline void cfq_slice_expired(struct cfq_data 
*cfqd, int preempted,
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-   struct cfq_queue *cfqq = NULL;
+   struct cfq_queue *cfqq;
+   struct rb_node *n;
 
-   if (!list_empty(cfqd-cur_rr)) {
-   /*
-* if current list is non-empty, grab first entry.
-*/
-   cfqq = list_entry_cfqq(cfqd-cur_rr.next);
-   } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) {
-   struct rb_node *n = cfq_rb_first(cfqd-service_tree);
+   if (RB_EMPTY_ROOT(cfqd-service_tree.rb))
+   return NULL;
 
-   cfqq = rb_entry(n, struct cfq_queue, rb_node);
-   if (cfq_class_idle(cfqq)) {
-   unsigned long end;
+   n = cfq_rb_first(cfqd-service_tree);
+   cfqq = rb_entry(n, struct cfq_queue, rb_node);
 
-   /*
-* if we have idle queues and no rt or be queues had
-* pending requests, either allow immediate service if
-

[PATCH 6/15] cfq-iosched: sort RT queues into the rbtree

2007-04-24 Thread Jens Axboe
Currently CFQ does a linked insert into the current list for RT
queues. We can just factor the class into the rb insertion,
and then we don't have to treat RT queues in a special way. It's
faster, too.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   27 ---
 1 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7f964ee..38ac492 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -471,7 +471,16 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
parent = *p;
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
-   if (rb_key  __cfqq-rb_key)
+   /*
+* sort RT queues first, we always want to give
+* preference to them. after that, sort on the next
+* service time.
+*/
+   if (cfq_class_rt(cfqq)  cfq_class_rt(__cfqq))
+   p = (*p)-rb_left;
+   else if (cfq_class_rt(cfqq)  cfq_class_rt(__cfqq))
+   p = (*p)-rb_right;
+   else if (rb_key  __cfqq-rb_key)
p = (*p)-rb_left;
else {
p = (*p)-rb_right;
@@ -490,7 +499,6 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
 static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted)
 {
struct cfq_data *cfqd = cfqq-cfqd;
-   struct list_head *n;
 
/*
 * Resorting requires the cfqq to be on the RR list already.
@@ -500,25 +508,14 @@ static void cfq_resort_rr_list(struct cfq_queue *cfqq, 
int preempted)
 
list_del_init(cfqq-cfq_list);
 
-   if (cfq_class_rt(cfqq)) {
-   /*
-* At to the front of the current list, but behind other
-* RT queues.
-*/
-   n = cfqd-cur_rr;
-   while (n-next != cfqd-cur_rr)
-   if (!cfq_class_rt(cfqq))
-   break;
-
-   list_add(cfqq-cfq_list, n);
-   } else if (cfq_class_idle(cfqq)) {
+   if (cfq_class_idle(cfqq)) {
/*
 * IDLE goes to the tail of the idle list
 */
list_add_tail(cfqq-cfq_list, cfqd-idle_rr);
} else {
/*
-* So we get here, ergo the queue is a regular best-effort queue
+* RT and BE queues, sort into the rbtree
 */
cfq_service_tree_add(cfqd, cfqq);
}
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi
  This is probably a
   reasonable thing to do but it doesn't feel like the right place.  I
   think get_dirty_limits should return the raw threshold, and
   balance_dirty_pages should do both tests - the bdi-local test and the
   system-wide test.
 
 Ok, that makes sense I guess.

Well, my narrow minded world view says it's not such a good idea,
because it would again introduce the deadlock scenario, we're trying
to avoid.

In a sense allowing a queue to go over the global limit just a little
bit is a good thing.  Actually the very original code does that: if
writeback was started for write_chunk number of pages, then we allow
ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
global limit.

That's why I've been saying, that the current code is so unfair: if
there are lots of dirty pages to be written back to a particular
device, then balance_dirty_pages() allows the dirty producer to make
even more pages dirty, but if there are _no_ dirty pages for a device,
and we are over the limit, then that dirty producer is allowed
absolutely no new dirty pages until the global counts subside.

I'm still not quite sure what purpose the above soft limiting
serves.  It seems to just give advantage to writers, which managed to
accumulate lots of dirty pages, and then can convert that into even
more dirtyings.

Would it make sense to remove this behavior, and ensure that
balance_dirty_pages() doesn't return until the per-queue limits have
been complied with?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/15] cfq-iosched: sort IDLE queues into the rbtree

2007-04-24 Thread Jens Axboe
Same treatment as the RT conversion, just put the sorted idle
branch at the end of the tree.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   67 +++---
 1 files changed, 31 insertions(+), 36 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 38ac492..e6cc77f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -92,7 +92,6 @@ struct cfq_data {
 */
struct cfq_rb_root service_tree;
struct list_head cur_rr;
-   struct list_head idle_rr;
unsigned int busy_queues;
 
/*
@@ -467,25 +466,33 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
 
while (*p) {
struct cfq_queue *__cfqq;
+   struct rb_node **n;
 
parent = *p;
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
/*
 * sort RT queues first, we always want to give
-* preference to them. after that, sort on the next
-* service time.
+* preference to them. IDLE queues goes to the back.
+* after that, sort on the next service time.
 */
if (cfq_class_rt(cfqq)  cfq_class_rt(__cfqq))
-   p = (*p)-rb_left;
+   n = (*p)-rb_left;
else if (cfq_class_rt(cfqq)  cfq_class_rt(__cfqq))
-   p = (*p)-rb_right;
+   n = (*p)-rb_right;
+   else if (cfq_class_idle(cfqq)  cfq_class_idle(__cfqq))
+   n = (*p)-rb_left;
+   else if (cfq_class_idle(cfqq)  cfq_class_idle(__cfqq))
+   n = (*p)-rb_right;
else if (rb_key  __cfqq-rb_key)
-   p = (*p)-rb_left;
-   else {
-   p = (*p)-rb_right;
+   n = (*p)-rb_left;
+   else
+   n = (*p)-rb_right;
+
+   if (n == (*p)-rb_right)
left = 0;
-   }
+
+   p = n;
}
 
if (left)
@@ -506,19 +513,7 @@ static void cfq_resort_rr_list(struct cfq_queue *cfqq, int 
preempted)
if (!cfq_cfqq_on_rr(cfqq))
return;
 
-   list_del_init(cfqq-cfq_list);
-
-   if (cfq_class_idle(cfqq)) {
-   /*
-* IDLE goes to the tail of the idle list
-*/
-   list_add_tail(cfqq-cfq_list, cfqd-idle_rr);
-   } else {
-   /*
-* RT and BE queues, sort into the rbtree
-*/
-   cfq_service_tree_add(cfqd, cfqq);
-   }
+   cfq_service_tree_add(cfqd, cfqq);
 }
 
 /*
@@ -791,20 +786,22 @@ static struct cfq_queue *cfq_get_next_queue(struct 
cfq_data *cfqd)
cfqq = list_entry_cfqq(cfqd-cur_rr.next);
} else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) {
struct rb_node *n = cfq_rb_first(cfqd-service_tree);
+   unsigned long end;
 
cfqq = rb_entry(n, struct cfq_queue, rb_node);
-   } else if (!list_empty(cfqd-idle_rr)) {
-   /*
-* if we have idle queues and no rt or be queues had pending
-* requests, either allow immediate service if the grace period
-* has passed or arm the idle grace timer
-*/
-   unsigned long end = cfqd-last_end_request + CFQ_IDLE_GRACE;
-
-   if (time_after_eq(jiffies, end))
-   cfqq = list_entry_cfqq(cfqd-idle_rr.next);
-   else
-   mod_timer(cfqd-idle_class_timer, end);
+   if (cfq_class_idle(cfqq)) {
+   /*
+* if we have idle queues and no rt or be queues had
+* pending requests, either allow immediate service if
+* the grace period has passed or arm the idle grace
+* timer
+*/
+   end = cfqd-last_end_request + CFQ_IDLE_GRACE;
+   if (time_before(jiffies, end)) {
+   mod_timer(cfqd-idle_class_timer, end);
+   cfqq = NULL;
+   }
+   }
}
 
return cfqq;
@@ -1068,7 +1065,6 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
}
 
dispatched += cfq_forced_dispatch_cfqqs(cfqd-cur_rr);
-   dispatched += cfq_forced_dispatch_cfqqs(cfqd-idle_rr);
 
cfq_slice_expired(cfqd, 0, 0);
 
@@ -2047,7 +2043,6 @@ static void *cfq_init_queue(request_queue_t *q)
 
cfqd-service_tree = CFQ_RB_ROOT;
INIT_LIST_HEAD(cfqd-cur_rr);
-   INIT_LIST_HEAD(cfqd-idle_rr);
INIT_LIST_HEAD(cfqd-cic_list);
 
cfqd-cfq_hash = kmalloc_node(sizeof(struct hlist_head) * 
CFQ_QHASH_ENTRIES, 

[PATCH 9/15] cfq-iosched: slice offset should take ioprio into account

2007-04-24 Thread Jens Axboe
Use the max_slice-cur_slice as the multipler for the insertion offset.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f86ff4d..251131a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -442,7 +442,8 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
/*
 * just an approximation, should be ok.
 */
-   return ((cfqd-busy_queues - 1) * cfq_prio_slice(cfqd, 1, 0));
+   return (cfqd-busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
+  cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq-ioprio));
 }
 
 /*
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/15] cfq-iosched: never allow an async queue idling

2007-04-24 Thread Jens Axboe
We don't enable it by default, don't let it get enabled during
runtime.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8f76aed..f920527 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1597,7 +1597,12 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
   struct cfq_io_context *cic)
 {
-   int enable_idle = cfq_cfqq_idle_window(cfqq);
+   int enable_idle;
+
+   if (!cfq_cfqq_sync(cfqq))
+   return;
+
+   enable_idle = cfq_cfqq_idle_window(cfqq);
 
if (!cic-ioc-task || !cfqd-cfq_slice_idle ||
(cfqd-hw_tag  CIC_SEEKY(cic)))
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Ingo Molnar

* Michael Gerdau [EMAIL PROTECTED] wrote:

  so to be totally 'fair' and get the same rescheduling 'granularity' 
  you should probably lower CFS's sched_granularity_ns to 2 msecs.
 
 I'll change default nice in cfs to -10.
 
 I'm also happy to adjust /proc/sys/kernel/sched_granularity_ns to 
 2msec. However checking /proc/sys/kernel/rr_interval reveals it is 16 
 (msec) on my system.

ah, yeah - there due to the SMP rule in SD:

   rr_interval *= 1 + ilog2(num_online_cpus());

and you have a 2-CPU system, so you get 8msec*2 == 16 msecs default 
interval. I find this a neat solution and i have talked to Con about 
this already and i'll adopt Con's idea in CFS too. Nevertheless, despite 
the settings, SD seems to be rescheduling every 6-7 msecs, while CFS 
reschedules only every 13 msecs.

Here i'm assuming that the vmstats are directly comparable: that your 
number-crunchers behave the same during the full runtime - is that 
correct? (If not then the vmstat result should be run at roughly the 
same type of stage of the workload, on all the schedulers.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/15] cfq-iosched: don't pass unused preemption variable around

2007-04-24 Thread Jens Axboe
We don't use it anymore in the slice expiry handling.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |   28 +---
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2d0e9c5..b680002 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -746,7 +746,7 @@ __cfq_set_active_queue(struct cfq_data *cfqd, struct 
cfq_queue *cfqq)
  */
 static void
 __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-   int preempted, int timed_out)
+   int timed_out)
 {
if (cfq_cfqq_wait_request(cfqq))
del_timer(cfqd-idle_slice_timer);
@@ -755,8 +755,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue 
*cfqq,
cfq_clear_cfqq_wait_request(cfqq);
 
/*
-* store what was left of this slice, if the queue idled out
-* or was preempted
+* store what was left of this slice, if the queue idled/timed out
 */
if (timed_out  !cfq_cfqq_slice_new(cfqq))
cfqq-slice_resid = cfqq-slice_end - jiffies;
@@ -774,13 +773,12 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
cfqd-dispatch_slice = 0;
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int preempted,
-int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
 {
struct cfq_queue *cfqq = cfqd-active_queue;
 
if (cfqq)
-   __cfq_slice_expired(cfqd, cfqq, preempted, timed_out);
+   __cfq_slice_expired(cfqd, cfqq, timed_out);
 }
 
 /*
@@ -989,7 +987,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data 
*cfqd)
}
 
 expire:
-   cfq_slice_expired(cfqd, 0, 0);
+   cfq_slice_expired(cfqd, 0);
 new_queue:
cfqq = cfq_set_active_queue(cfqd);
 keep_queue:
@@ -1043,7 +1041,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
cfqq-slice_end = jiffies + 1;
-   cfq_slice_expired(cfqd, 0, 0);
+   cfq_slice_expired(cfqd, 0);
}
 
return dispatched;
@@ -1077,7 +1075,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
}
 
-   cfq_slice_expired(cfqd, 0, 0);
+   cfq_slice_expired(cfqd, 0);
 
BUG_ON(cfqd-busy_queues);
 
@@ -1147,7 +1145,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
BUG_ON(cfq_cfqq_on_rr(cfqq));
 
if (unlikely(cfqd-active_queue == cfqq)) {
-   __cfq_slice_expired(cfqd, cfqq, 0, 0);
+   __cfq_slice_expired(cfqd, cfqq, 0);
cfq_schedule_dispatch(cfqd);
}
 
@@ -1204,7 +1202,7 @@ static void cfq_free_io_context(struct io_context *ioc)
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
if (unlikely(cfqq == cfqd-active_queue)) {
-   __cfq_slice_expired(cfqd, cfqq, 0, 0);
+   __cfq_slice_expired(cfqd, cfqq, 0);
cfq_schedule_dispatch(cfqd);
}
 
@@ -1677,7 +1675,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct 
cfq_queue *new_cfqq,
  */
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-   cfq_slice_expired(cfqd, 1, 1);
+   cfq_slice_expired(cfqd, 1);
 
/*
 * Put the new queue at the front of the of the current list,
@@ -1784,7 +1782,7 @@ static void cfq_completed_request(request_queue_t *q, 
struct request *rq)
cfq_clear_cfqq_slice_new(cfqq);
}
if (cfq_slice_used(cfqq))
-   cfq_slice_expired(cfqd, 0, 1);
+   cfq_slice_expired(cfqd, 1);
else if (sync  RB_EMPTY_ROOT(cfqq-sort_list))
cfq_arm_slice_timer(cfqd);
}
@@ -1979,7 +1977,7 @@ static void cfq_idle_slice_timer(unsigned long data)
}
}
 expire:
-   cfq_slice_expired(cfqd, 0, timed_out);
+   cfq_slice_expired(cfqd, timed_out);
 out_kick:
cfq_schedule_dispatch(cfqd);
 out_cont:
@@ -2025,7 +2023,7 @@ static void cfq_exit_queue(elevator_t *e)
spin_lock_irq(q-queue_lock);
 
if (cfqd-active_queue)
-   __cfq_slice_expired(cfqd, cfqd-active_queue, 0, 0);
+   __cfq_slice_expired(cfqd, cfqd-active_queue, 0);
 
while (!list_empty(cfqd-cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd-cic_list.next,
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/15] cfq-iosched: tighten queue request overlap condition

2007-04-24 Thread Jens Axboe
For tagged devices, allow overlap of requests if the idle window
isn't enabled on the current active queue.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/cfq-iosched.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 772df89..8093733 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -983,7 +983,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data 
*cfqd)
 * flight or is idling for a new request, allow either of these
 * conditions to happen (or time out) before selecting a new queue.
 */
-   if (cfqq-dispatched || timer_pending(cfqd-idle_slice_timer)) {
+   if (timer_pending(cfqd-idle_slice_timer) ||
+   (cfqq-dispatched  cfq_cfqq_idle_window(cfqq))) {
cfqq = NULL;
goto keep_queue;
}
-- 
1.5.1.1.190.g74474

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/7] genhd: send async notification on media change

2007-04-24 Thread Tejun Heo
Kristen Carlson Accardi wrote:
 Send an uevent to user space to indicate that a media change event has 
 occurred.
 
 Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]
 
 Index: 2.6-git/block/genhd.c
 ===
 --- 2.6-git.orig/block/genhd.c
 +++ 2.6-git/block/genhd.c
 @@ -643,6 +643,25 @@ struct seq_operations diskstats_op = {
   .show   = diskstats_show
  };
  
 +static void media_change_notify_thread(struct work_struct *work)
 +{
 + struct gendisk *gd = container_of(work, struct gendisk, async_notify);
 + char event[] = MEDIA_CHANGE=1;
 + char *envp[] = { event, NULL };
 +
 + /*
 +  * set enviroment vars to indicate which event this is for
 +  * so that user space will know to go check the media status.
 +  */
 + kobject_uevent_env(gd-kobj, KOBJ_CHANGE, envp);
 +}
 +
 +void genhd_media_change_notify(struct gendisk *disk)
 +{
 + schedule_work(disk-async_notify);
 +}
 +EXPORT_SYMBOL_GPL(genhd_media_change_notify);

genhd might go away while async_notify work is in-flight.  You'll need
to either grab a reference or wait for the work to finish in release
routine.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Ingo Molnar wrote:

* Rogan Dawes [EMAIL PROTECTED] wrote:


   if (p_to  p-wait_runtime  0) {
   p-wait_runtime = 1;
   p_to-wait_runtime += p-wait_runtime;
   }

the above is the basic expression of: charge a positive bank balance. 


[..]

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]

Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?


yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.


Ingo


I guess my point was if we somehow get to an odd number of nanoseconds, 
we'd end up with rounding errors. I'm not sure if your algorithm will 
ever allow that.


Rogan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH -mm take4 2/6] support multiple logging

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 17:14:28 +0900 Keiichi KII [EMAIL PROTECTED] wrote:

  On Fri, 20 Apr 2007 18:51:13 +0900
  Keiichi KII [EMAIL PROTECTED] wrote:
  
  I started to do some cleanups and fixups here, but abandoned it when it 
  was
  all getting a bit large.
 
  Here are some fixes against this patch:
  I'm going to fix my patches by following your reviews and send new patches 
  on the LKML and the netdev ML in a few days.
 
  
  Well..  before you can finish this work we need to decide upon what the
  interface to userspace will be.
  
  - The miscdev isn't appropriate
  
 
 Why isn't miscdev appropriate? 
 We just shouldn't use miscdev for networking conventionally?
 

Yes it's rather odd, especially for networking.

What does the miscdev _do_ anyway?  Is it purely a target for the ioctls?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 10:19 +0200, Miklos Szeredi wrote:
   This is probably a
reasonable thing to do but it doesn't feel like the right place.  I
think get_dirty_limits should return the raw threshold, and
balance_dirty_pages should do both tests - the bdi-local test and the
system-wide test.
  
  Ok, that makes sense I guess.
 
 Well, my narrow minded world view says it's not such a good idea,
 because it would again introduce the deadlock scenario, we're trying
 to avoid.

I was only referring to the placement of the clipping; and exactly where
that happens does not affect the deadlock.

 In a sense allowing a queue to go over the global limit just a little
 bit is a good thing.  Actually the very original code does that: if
 writeback was started for write_chunk number of pages, then we allow
 ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
 global limit.

It might be time to get rid of that rate-limiting.
balance_dirty_pages()'s fast path is not nearly as heavy as it used to
be. All these fancy counter systems have removed quite a bit of
iteration from there.

 That's why I've been saying, that the current code is so unfair: if
 there are lots of dirty pages to be written back to a particular
 device, then balance_dirty_pages() allows the dirty producer to make
 even more pages dirty, but if there are _no_ dirty pages for a device,
 and we are over the limit, then that dirty producer is allowed
 absolutely no new dirty pages until the global counts subside.

Well, that got fixed on a per device basis with this patch, it is still
true for multiple tasks writing to the same device.

 I'm still not quite sure what purpose the above soft limiting
 serves.  It seems to just give advantage to writers, which managed to
 accumulate lots of dirty pages, and then can convert that into even
 more dirtyings.

The queues only limit the actual in-flight writeback pages,
balance_dirty_pages() considers all pages that might become writeback as
well as those that are.

 Would it make sense to remove this behavior, and ensure that
 balance_dirty_pages() doesn't return until the per-queue limits have
 been complied with?

I don't think that will help, balance_dirty_pages drives the queues.
That is, it converts pages from mere dirty to writeback.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq default governor

2007-04-24 Thread Michal Piotrowski

Hi William,

On 24/04/07, William Heimbigner [EMAIL PROTECTED] wrote:

Question: is there some reason that kconfig does not allow for default
governors of conservative/ondemand/powersave?


Performance?


I'm not aware of any reason why one of those governors could not be used
as default.


My hardware doesn't work properly with ondemand governor. I hear
strange noises when frequency is changed.



William Heimbigner
[EMAIL PROTECTED]


Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 03:55:06PM +1000, Paul Mackerras wrote:
 Christoph Hellwig writes:
 
  The first question is obviously, is this really something we want?
  spawning kernel thread on demand without reaping them properly seems
  quite dangerous.
 
 What specifically has to be done to reap a kernel thread?  Are you
 concerned about the number of threads, or about having zombies hanging
 around?

I'm mostly concerned about number of threads and possible leakage of
threads.  Linas already explained it's not a problem in this case,
so it's covered.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Michael Gerdau
 Here i'm assuming that the vmstats are directly comparable: that your 
 number-crunchers behave the same during the full runtime - is that 
 correct?

Yes, basically it does (disregarding small fluctuations)

I'll see whether I can produce some type of absolute performance
measure as well. Thinking about it I guess this should be fairly
simple to implement.

Best,
Michael
-- 
 Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar
 Sitz Hamburg; HRB 89145 Amtsgericht Hamburg
 Vote against SPAM - see http://www.politik-digital.de/spam/
 Michael Gerdau   email: [EMAIL PROTECTED]
 GPG-keys available on request or at public keyserver


pgprODjr3hqXe.pgp
Description: PGP signature


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Ingo Molnar

* Michael Gerdau [EMAIL PROTECTED] wrote:

  Here i'm assuming that the vmstats are directly comparable: that 
  your number-crunchers behave the same during the full runtime - is 
  that correct?
 
 Yes, basically it does (disregarding small fluctuations)

ok, good.

 I'll see whether I can produce some type of absolute performance 
 measure as well. Thinking about it I guess this should be fairly 
 simple to implement.

oh, you are writing the number-cruncher? In general the 'best' 
performance metrics for scheduler validation are the ones where you have 
immediate feedback: i.e. some ops/sec (or ops per minute) value in some 
readily accessible place, or some milliseconds-per-100,000 ops type of 
metric - whichever lends itself better to the workload at hand. If you 
measure time then the best is to use long long and nanoseconds and the 
monotonic clocksource:

 unsigned long long rdclock(void)
 {
struct timespec ts;

clock_gettime(CLOCK_MONOTONIC, ts);

return ts.tv_sec * 10ULL + ts.tv_nsec;
 }

(link to librt via -lrt to pick up clock_gettime())

The cost of a clock_gettime() (or of a gettimeofday()) can be a couple 
of microseconds on some systems, so it shouldnt be done too frequently.

Plus an absolute metric of the whole workload took X.Y seconds is 
useful too.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm] utrace: fix double free re __rcu_process_callbacks()

2007-04-24 Thread Alexey Dobriyan
The following patch fixes double free manifesting itself as crash in
__rcu_process_callbasks():
http://marc.info/?l=linux-kernelm=117518764517017w=2
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229112

The problem is with check_dead_utrace() conditionally scheduling
struct utrace for freeing but not cleaning struct task_struct::utrace
pointer leaving it reachable:

tsk-utrace_flags = flags;
if (flags)
spin_unlock(utrace-lock);
else
rcu_utrace_free(utrace);

OTOH, utrace_release_task() first clears -utrace pointer, then frees
struct utrace itself:

Roland inserted some debugging into 2.6.21-rc6-mm1 so that aforementined
double free couldn't be reproduced without seeing
BUG at kernel/utrace.c:176 first. It triggers if one struct utrace were
passed to rcu_utrace_free() second time.

With patch applied I no longer see¹ BUG message and double frees on
2-way P3, 8-way ia64, Core 2 Duo boxes. Testcase is at the first link.

I _think_ it adds leak if utrace_reap() takes branch without freeing
but, well, I hope Roland will give me some clue on how to fix it too.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 kernel/utrace.c |6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

¹ But I see whole can of other bugs! I think they were already lurking
  but weren't easily reproducable without hitting double-free first.
  FWIW, it's
BUG_ON(!list_empty(tsk-ptracees));
oops at the beginning of remove_engine()
NULL -report_quiesce call which is absent in ptrace utrace ops
BUG_ON(tracehook_check_released(p));

--- a/kernel/utrace.c
+++ b/kernel/utrace.c
@@ -205,7 +205,6 @@ utrace_clear_tsk(struct task_struct *tsk
if (utrace-u.live.signal == NULL) {
task_lock(tsk);
if (likely(tsk-utrace != NULL)) {
-   rcu_assign_pointer(tsk-utrace, NULL);
tsk-utrace_flags = UTRACE_ACTION_NOREAP;
}
task_unlock(tsk);
@@ -305,10 +304,7 @@ check_dead_utrace(struct task_struct *ts
}
 
tsk-utrace_flags = flags;
-   if (flags)
-   spin_unlock(utrace-lock);
-   else
-   rcu_utrace_free(utrace);
+   spin_unlock(utrace-lock);
 
/*
 * Now we're finished updating the utrace state.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v5 vs sd-0.46

2007-04-24 Thread Michael Gerdau
 oh, you are writing the number-cruncher?

Yep.

 In general the 'best'  
 performance metrics for scheduler validation are the ones where you have 
 immediate feedback: i.e. some ops/sec (or ops per minute) value in some 
 readily accessible place, or some milliseconds-per-100,000 ops type of 
 metric - whichever lends itself better to the workload at hand.

I'll have to see whether that works out. I don't have an easily
available ops/sec but I guess I could create something similar.

 If you  
 measure time then the best is to use long long and nanoseconds and the 
 monotonic clocksource:

[snip]
Thanks, I will implement that, for Linux anyway.

 Plus an absolute metric of the whole workload took X.Y seconds is 
 useful too.

That's the easiest to come by and is already available.

Best,
Michael
-- 
 Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar
 Sitz Hamburg; HRB 89145 Amtsgericht Hamburg
 Vote against SPAM - see http://www.politik-digital.de/spam/
 Michael Gerdau   email: [EMAIL PROTECTED]
 GPG-keys available on request or at public keyserver


pgpwqhqmZDVz7.pgp
Description: PGP signature


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi
This is probably a
 reasonable thing to do but it doesn't feel like the right place.  I
 think get_dirty_limits should return the raw threshold, and
 balance_dirty_pages should do both tests - the bdi-local test and the
 system-wide test.
   
   Ok, that makes sense I guess.
  
  Well, my narrow minded world view says it's not such a good idea,
  because it would again introduce the deadlock scenario, we're trying
  to avoid.
 
 I was only referring to the placement of the clipping; and exactly where
 that happens does not affect the deadlock.

OK.

  In a sense allowing a queue to go over the global limit just a little
  bit is a good thing.  Actually the very original code does that: if
  writeback was started for write_chunk number of pages, then we allow
  ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
  global limit.
 
 It might be time to get rid of that rate-limiting.
 balance_dirty_pages()'s fast path is not nearly as heavy as it used to
 be. All these fancy counter systems have removed quite a bit of
 iteration from there.

Hmm.  The rate limiting probably makes lots of sense for
dirty_exceeded==0, when ratelimit can be a nice large value.

For dirty_exceeded==1 it may make sense to disable ratelimiting, OTOH
having a granularity of 8 pages probably doesn't matter, because of
the granularity of the percpu counter is usually larger (except on UP).

  That's why I've been saying, that the current code is so unfair: if
  there are lots of dirty pages to be written back to a particular
  device, then balance_dirty_pages() allows the dirty producer to make
  even more pages dirty, but if there are _no_ dirty pages for a device,
  and we are over the limit, then that dirty producer is allowed
  absolutely no new dirty pages until the global counts subside.
 
 Well, that got fixed on a per device basis with this patch, it is still
 true for multiple tasks writing to the same device.

Yes, this is the part of this patchset I'm personally interested in ;)

  I'm still not quite sure what purpose the above soft limiting
  serves.  It seems to just give advantage to writers, which managed to
  accumulate lots of dirty pages, and then can convert that into even
  more dirtyings.
 
 The queues only limit the actual in-flight writeback pages,
 balance_dirty_pages() considers all pages that might become writeback as
 well as those that are.
 
  Would it make sense to remove this behavior, and ensure that
  balance_dirty_pages() doesn't return until the per-queue limits have
  been complied with?
 
 I don't think that will help, balance_dirty_pages drives the queues.
 That is, it converts pages from mere dirty to writeback.

Yes.  But current logic says, that if you convert write_chunk dirty
to writeback, you are allowed to dirty ratelimit more. 

D: number of dirty pages
W: number of writeback pages
L: global limit
C: write_chunk = ratelimit_pages * 1.5
R: ratelimit

If D+W = L, then R = 8

Let's assume, that D == L and W == 0.  And that all of the dirty pages
belong to a single device.  Also for simplicity, lets assume an
infinite length queue, and a slow device.

Then while converting the dirty pages to writeback, D / C * R new
dirty pages can be created.  So when all existing dirty have been
converted:

  D = L / C * R
  W = L

  D + W = L * (1 + R / C)

So we see, that we're now even more above the limit than before the
conversion.  This means, that we starve writers to other devices,
which don't have as many dirty pages, because until the slow device
doesn't finish these writes they will not get to do anything.

Your patch helps this in that if the other writers have an empty queue
and no dirty, they will be allowed to slowly start writing.  But they
will not gain their full share until the slow dirty-hog goes below the
global limit, which may take some time.

So I think the logical thing to do, is if the dirty-hog is over it's
queue limit, don't let it dirty any more until it's dirty+writeback go
below the limit.  That allowes other devices to more quickly gain
their share of dirty pages.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sendfile to nonblocking socket

2007-04-24 Thread David Schwartz

 David Schwartz пишет:
  You have a misunderstanding about the semantics of 'sendfile'. 
 The 'sendfile' function is just a more efficient version of a 
 read followed by a write. If you did a read followed by a write, 
 it would block as well (in the read).
 
  DS

 sendfile function is not just a more efficient version of a read 
 followed by a write.  It reads from one fd and write to another at tha 
 same time. Please try to read 2G, and then write 2G - and how much 
 memory you will be need and how much time you will loose while reading 
 2G from disk, but not writing them to socket.

You are correct. What I meant to say was that it's just a more efficient 
version of 'mmap'ing a file and then 'write'ing from the 'mmap'. The 'write' to 
a non-blocking socket can still 'block' on disk I/O.

 If you know more 
 efficient method to transfer file from disk to network - please advise. 
 Now all I want is really non-blocking sendfile. Currently sendfile is 
 non-blocking on network, but not on disk i/o. And when I have network 
 faster than disk - I get block.

There are many different techniques and which is correct depends on what 
direction you want to go. POSIX asynchronous I/O is one possibility. Threads 
plus epoll is another. It really depends upon how much performance you need, 
how much complexity you can tolerate, and how portable you need to be.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 11:14 +0200, Miklos Szeredi wrote:

   I'm still not quite sure what purpose the above soft limiting
   serves.  It seems to just give advantage to writers, which managed to
   accumulate lots of dirty pages, and then can convert that into even
   more dirtyings.
  
  The queues only limit the actual in-flight writeback pages,
  balance_dirty_pages() considers all pages that might become writeback as
  well as those that are.
  
   Would it make sense to remove this behavior, and ensure that
   balance_dirty_pages() doesn't return until the per-queue limits have
   been complied with?
  
  I don't think that will help, balance_dirty_pages drives the queues.
  That is, it converts pages from mere dirty to writeback.
 
 Yes.  But current logic says, that if you convert write_chunk dirty
 to writeback, you are allowed to dirty ratelimit more. 
 
 D: number of dirty pages
 W: number of writeback pages
 L: global limit
 C: write_chunk = ratelimit_pages * 1.5
 R: ratelimit
 
 If D+W = L, then R = 8
 
 Let's assume, that D == L and W == 0.  And that all of the dirty pages
 belong to a single device.  Also for simplicity, lets assume an
 infinite length queue, and a slow device.
 
 Then while converting the dirty pages to writeback, D / C * R new
 dirty pages can be created.  So when all existing dirty have been
 converted:
 
   D = L / C * R
   W = L
 
   D + W = L * (1 + R / C)
 
 So we see, that we're now even more above the limit than before the
 conversion.  This means, that we starve writers to other devices,
 which don't have as many dirty pages, because until the slow device
 doesn't finish these writes they will not get to do anything.
 
 Your patch helps this in that if the other writers have an empty queue
 and no dirty, they will be allowed to slowly start writing.  But they
 will not gain their full share until the slow dirty-hog goes below the
 global limit, which may take some time.
 
 So I think the logical thing to do, is if the dirty-hog is over it's
 queue limit, don't let it dirty any more until it's dirty+writeback go
 below the limit.  That allowes other devices to more quickly gain
 their share of dirty pages.

Ahh, now I see; I had totally blocked out these few lines:

pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty */

yeah, those look dubious indeed... And reading back Neil's comments, I
think he agrees.

Shall we just kill those?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/3] 2.6.21-rc7: known regressions (v2)

2007-04-24 Thread Wolfgang Erig
On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote:
 On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote:
  This email lists some known regressions in Linus' tree compared to 2.6.20.
  
  If you find your name in the Cc header, you are either submitter of one
  of the bugs, maintainer of an affectected subsystem or driver, a patch
  of you caused a breakage or I'm considering you in any other way
  possibly involved with one or more of these issues.
  
  Due to the huge amount of recipients, please trim the Cc when answering.
  
  
  Subject: gammu no longer works
  References : http://lkml.org/lkml/2007/4/20/84
  Submitter  : Wolfgang Erig [EMAIL PROTECTED]
  Status : unknown
 
 I've asked for more information about this, and so far am not sure it's
 a real problem.

It is a real problem for me.
I tried this on 2 different boxes with the same behaviour.
No sync between my Nokia mobile and Linux with the latest kernel :(

Which additional information is useful for this problem?

Wolfgang


$ gammu textall --backup backup
Press Ctrl+C to break...
[Gammu- 1.10.0 built 10:15:07 Mar 13 2007 in gcc 4.1]
[Connection   - fbuspl2303]
[Model type   - 3100]
[Device   - /dev/ttyUSB0]
[Run on   - Linux, kernel 2.6.21-rc7-g80d74d51 (#9 SMP Wed Apr 18 
21:41:41 CEST 2007)]
[Module   - 
1100|1100a|1100b|2650|3100|3100b|3105|3108|3200|3200a|3205|3220|3300|3510|3510i|3530|3589i|3590|3595|5100|5140|5140i|6020|6021|6030|6100|6101|6103|6111|6125|6131|6170|6200|6220|6230|6230i|6233|6234|6270|6280|6310|6310i|6385|6510|6610|6610i|6800|6810|6820|6822|7200|7210|7250|7250i|7260|7270|7360|7370|7600|8310|8390|8910|8910i]
Setting speed to 19200
I/O possible

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/8] Per-container pages reclamation

2007-04-24 Thread Balbir Singh

Pavel Emelianov wrote:

Implement try_to_free_pages_in_container() to free the
pages in container that has run out of memory.

The scan_control-isolate_pages() function isolates the
container pages only.



Pavel,

I've just started playing around with these patches, I preferred
the approach of v1. Please see below


+static unsigned long isolate_container_pages(unsigned long nr_to_scan,
+   struct list_head *src, struct list_head *dst,
+   unsigned long *scanned, struct zone *zone)
+{
+   unsigned long nr_taken = 0;
+   struct page *page;
+   struct page_container *pc;
+   unsigned long scan;
+   LIST_HEAD(pc_list);
+
+   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
+   pc = list_entry(src-prev, struct page_container, list);
+   page = pc-page;
+   if (page_zone(page) != zone)
+   continue;


shrink_zone() will walk all pages looking for pages belonging to this
container and this slows down the reclaim quite a bit. Although we've
reused code, we've ended up walking the entire list of the zone to
find pages belonging to a particular container, this was the same
problem I had with my RSS controller patches.


+
+   list_move(pc-list, pc_list);
+



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi
 Ahh, now I see; I had totally blocked out these few lines:
 
   pages_written += write_chunk - wbc.nr_to_write;
   if (pages_written = write_chunk)
   break;  /* We've done our duty */
 
 yeah, those look dubious indeed... And reading back Neil's comments, I
 think he agrees.
 
 Shall we just kill those?

I think we should.

Athough I'm a little afraid, that Akpm will tell me again, that I'm a
stupid git, and that those lines are in fact vitally important ;)

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH -mm] utrace: fix double free re __rcu_process_callbacks()

2007-04-24 Thread Kirill Korotaev
Roland,

can you please help with it?
current utrace state is far from being stable,
RHEL5 and -mm kernels can be quite easily crashed with some of the exploits
we collected so far.
Alexey can help you with any information needed - call traces, test cases,
but without your help we can't fix it all ourselfes :/

Thanks,
Kirill

Alexey Dobriyan wrote:
 The following patch fixes double free manifesting itself as crash in
 __rcu_process_callbasks():
 http://marc.info/?l=linux-kernelm=117518764517017w=2
 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229112
 
 The problem is with check_dead_utrace() conditionally scheduling
 struct utrace for freeing but not cleaning struct task_struct::utrace
 pointer leaving it reachable:
 
   tsk-utrace_flags = flags;
   if (flags)
   spin_unlock(utrace-lock);
   else
   rcu_utrace_free(utrace);
 
 OTOH, utrace_release_task() first clears -utrace pointer, then frees
 struct utrace itself:
 
 Roland inserted some debugging into 2.6.21-rc6-mm1 so that aforementined
 double free couldn't be reproduced without seeing
 BUG at kernel/utrace.c:176 first. It triggers if one struct utrace were
 passed to rcu_utrace_free() second time.
 
 With patch applied I no longer see¹ BUG message and double frees on
 2-way P3, 8-way ia64, Core 2 Duo boxes. Testcase is at the first link.
 
 I _think_ it adds leak if utrace_reap() takes branch without freeing
 but, well, I hope Roland will give me some clue on how to fix it too.
 
 Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
 ---
 
  kernel/utrace.c |6 +-
  1 file changed, 1 insertion(+), 5 deletions(-)
 
 ¹ But I see whole can of other bugs! I think they were already lurking
   but weren't easily reproducable without hitting double-free first.
   FWIW, it's
   BUG_ON(!list_empty(tsk-ptracees));
   oops at the beginning of remove_engine()
   NULL -report_quiesce call which is absent in ptrace utrace ops
   BUG_ON(tracehook_check_released(p));
 
 --- a/kernel/utrace.c
 +++ b/kernel/utrace.c
 @@ -205,7 +205,6 @@ utrace_clear_tsk(struct task_struct *tsk
   if (utrace-u.live.signal == NULL) {
   task_lock(tsk);
   if (likely(tsk-utrace != NULL)) {
 - rcu_assign_pointer(tsk-utrace, NULL);
   tsk-utrace_flags = UTRACE_ACTION_NOREAP;
   }
   task_unlock(tsk);
 @@ -305,10 +304,7 @@ check_dead_utrace(struct task_struct *ts
   }
  
   tsk-utrace_flags = flags;
 - if (flags)
 - spin_unlock(utrace-lock);
 - else
 - rcu_utrace_free(utrace);
 + spin_unlock(utrace-lock);
  
   /*
* Now we're finished updating the utrace state.
 
 ___
 Devel mailing list
 [EMAIL PROTECTED]
 https://openvz.org/mailman/listinfo/devel
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Yasunori Goto


 Subject: Check zone boundaries when freeing bootmem
 Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES. 

Hmm. I don't understand here yet... Could you explain more? 

This issue occurs only when ZONE_MOVABLE is specified.
If its boundary is aligned to MAX_ORDER automatically,
I guess user will not mind it.

From memory hotplug view, I prefer section size alignment to make
simple code. :-P


 However,
 during boot, there is an implicit assumption that they are aligned to a
 BITS_PER_LONG boundary when freeing pages as quickly as possible. This
 patch checks the zone boundaries when freeing pages from the bootmem 
 allocator.

Anyway, the patch works well.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[1/2] w1: allow bus master to have reset and byte ops.

2007-04-24 Thread Evgeniy Polyakov
Signed-off-by: Matt Reimer [EMAIL PROTECTED]
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

---
 drivers/w1/w1_int.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/w1/w1_int.c b/drivers/w1/w1_int.c
index 357a2e0..258defd 100644
--- a/drivers/w1/w1_int.c
+++ b/drivers/w1/w1_int.c
@@ -100,7 +100,8 @@ int w1_add_master_device(struct w1_bus_master *master)
 
 /* validate minimum functionality */
 if (!(master-touch_bit  master-reset_bus) 
-!(master-write_bit  master-read_bit)) {
+!(master-write_bit  master-read_bit) 
+   !(master-write_byte  master-read_byte  master-reset_bus)) {
printk(KERN_ERR w1_add_master_device: invalid function set\n);
return(-EINVAL);
 }

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Andrew Morton
On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

  Ahh, now I see; I had totally blocked out these few lines:
  
  pages_written += write_chunk - wbc.nr_to_write;
  if (pages_written = write_chunk)
  break;  /* We've done our duty */
  
  yeah, those look dubious indeed... And reading back Neil's comments, I
  think he agrees.
  
  Shall we just kill those?
 
 I think we should.
 
 Athough I'm a little afraid, that Akpm will tell me again, that I'm a
 stupid git, and that those lines are in fact vitally important ;)
 

It depends what they're replaced with.

That code is there, iirc, to prevent a process from getting stuck in
balance_dirty_pages() forever due to the dirtying activity of other
processes.

hm, we ask the process to write write_chunk pages each go around the loop.
So if it wrote write-chunk/2 pages on the first pass it might end up writing
write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
if it does happen - the upper bound is write_chunk*2-1, I think.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[2/2] Driver for the Maxim DS1WM, a 1-wire bus master ASIC core.

2007-04-24 Thread Evgeniy Polyakov
Signed-off-by: Matt Reimer [EMAIL PROTECTED]
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

---
 drivers/w1/masters/Kconfig  |8 +
 drivers/w1/masters/Makefile |2 +-
 drivers/w1/masters/ds1wm.c  |  463 +++
 include/linux/ds1wm.h   |   13 ++
 4 files changed, 485 insertions(+), 1 deletions(-)
 create mode 100644 drivers/w1/masters/ds1wm.c
 create mode 100644 include/linux/ds1wm.h

diff --git a/drivers/w1/masters/Kconfig b/drivers/w1/masters/Kconfig
index 2fb4255..ca44f9e 100644
--- a/drivers/w1/masters/Kconfig
+++ b/drivers/w1/masters/Kconfig
@@ -35,5 +35,13 @@ config W1_MASTER_DS2482
  This driver can also be built as a module.  If so, the module
  will be called ds2482.
 
+config W1_DS1WM
+   tristate Maxim DS1WM 1-wire busmaster
+   depends on W1
+   help
+ Say Y here to enable the DS1WM 1-wire driver, such as that
+ in HP iPAQ devices like h5xxx, h2200, and ASIC3-based like
+ hx4700.
+
 endmenu
 
diff --git a/drivers/w1/masters/Makefile b/drivers/w1/masters/Makefile
index 4cee256..a9e45fb 100644
--- a/drivers/w1/masters/Makefile
+++ b/drivers/w1/masters/Makefile
@@ -5,4 +5,4 @@
 obj-$(CONFIG_W1_MASTER_MATROX) += matrox_w1.o
 obj-$(CONFIG_W1_MASTER_DS2490) += ds2490.o
 obj-$(CONFIG_W1_MASTER_DS2482) += ds2482.o
-
+obj-$(CONFIG_W1_DS1WM)  += ds1wm.o
diff --git a/drivers/w1/masters/ds1wm.c b/drivers/w1/masters/ds1wm.c
new file mode 100644
index 000..cea74e1
--- /dev/null
+++ b/drivers/w1/masters/ds1wm.c
@@ -0,0 +1,463 @@
+/*
+ * 1-wire busmaster driver for DS1WM and ASICs with embedded DS1WMs
+ * such as HP iPAQs (including h5xxx, h2200, and devices with ASIC3
+ * like hx4700).
+ *
+ * Copyright (c) 2004-2005, Szabolcs Gyurko [EMAIL PROTECTED]
+ * Copyright (c) 2004-2007, Matt Reimer [EMAIL PROTECTED]
+ *
+ * Use consistent with the GNU GPL is permitted,
+ * provided that this copyright notice is
+ * preserved in its entirety in all copies and derived works.
+ */
+
+#include linux/module.h
+#include linux/interrupt.h
+#include linux/irq.h
+#include linux/pm.h
+#include linux/platform_device.h
+#include linux/clk.h
+#include linux/delay.h
+#include linux/ds1wm.h
+
+#include asm/io.h
+
+#include ../w1.h
+#include ../w1_int.h
+
+
+#define DS1WM_CMD  0x00/* R/W 4 bits command */
+#define DS1WM_DATA 0x01/* R/W 8 bits, transmit/receive buffer */
+#define DS1WM_INT  0x02/* R/W interrupt status */
+#define DS1WM_INT_EN   0x03/* R/W interrupt enable */
+#define DS1WM_CLKDIV   0x04/* R/W 5 bits of divisor and pre-scale */
+
+#define DS1WM_CMD_1W_RESET  1  0 /* force reset on 1-wire bus */
+#define DS1WM_CMD_SRA  1  1  /* enable Search ROM accelerator mode */
+#define DS1WM_CMD_DQ_OUTPUT 1  2 /* write only - forces bus low */
+#define DS1WM_CMD_DQ_INPUT  1  3 /* read only - reflects state of bus */
+
+#define DS1WM_INT_PD   1  0  /* presence detect */
+#define DS1WM_INT_PDR  1  1  /* presence detect result */
+#define DS1WM_INT_TBE  1  2  /* tx buffer empty */
+#define DS1WM_INT_TSRE 1  3  /* tx shift register empty */
+#define DS1WM_INT_RBF  1  4  /* rx buffer full */
+#define DS1WM_INT_RSRF 1  5  /* rx shift register full */
+
+#define DS1WM_INTEN_EPD1  0  /* enable presence detect int */
+#define DS1WM_INTEN_IAS1  1  /* INTR active state */
+#define DS1WM_INTEN_ETBE1  2 /* enable tx buffer empty int */
+#define DS1WM_INTEN_ETMT1  3 /* enable tx shift register empty int */
+#define DS1WM_INTEN_ERBF1  4 /* enable rx buffer full int */
+#define DS1WM_INTEN_ERSRF   1  5 /* enable rx shift register full int */
+#define DS1WM_INTEN_DQO1  6  /* enable direct bus driving ops
+  (undocumented), Szabolcs Gyurko */
+
+
+#define DS1WM_TIMEOUT (HZ * 5)
+
+static struct {
+   unsigned long freq;
+   unsigned long divisor;
+} freq[] = {
+   { 400, 0x8 },
+   { 500, 0x2 },
+   { 600, 0x5 },
+   { 700, 0x3 },
+   { 800, 0xc },
+   { 1000, 0x6 },
+   { 1200, 0x9 },
+   { 1400, 0x7 },
+   { 1600, 0x10 },
+   { 2000, 0xa },
+   { 2400, 0xd },
+   { 2800, 0xb },
+   { 3200, 0x14 },
+   { 4000, 0xe },
+   { 4800, 0x11 },
+   { 5600, 0xf },
+   { 6400, 0x18 },
+   { 8000, 0x12 },
+   { 9600, 0x15 },
+   { 11200, 0x13 },
+   { 12800, 0x1c },
+};
+
+struct ds1wm_data {
+   void*map;
+   int bus_shift; /* # of shifts to calc register offsets */
+   struct platform_device *pdev;
+   struct ds1wm_platform_data *pdata;
+   int irq;
+   struct clk  *clk;
+   int slave_present;
+   void*reset_complete;
+   void   

Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Mel Gorman

On Tue, 24 Apr 2007, Yasunori Goto wrote:





Subject: Check zone boundaries when freeing bootmem
Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES.


Hmm. I don't understand here yet... Could you explain more?



Nodes are required to be MAX_ORDER_NR_PAGES aligned for the buddy 
algorithm to work but zones can be at any alignment because the 
page_is_buddy() check checks the zone_id of two buddies when merging. As 
zones are generally aligned anyway, it was never noticed that the bootmem 
allocators assumes zones are at least order-5 aligned on 32 bit and 
order-6 aligned on 64 bit.



This issue occurs only when ZONE_MOVABLE is specified.


Yes, because it can be sized to any value. At the moment, zones are 
aligned to MAX_ORDER_NR_PAGES so it was not noticed that bootmem makes 
assumptions on zone alignment.



If its boundary is aligned to MAX_ORDER automatically,
I guess user will not mind it.



Probably not. They will get a different amount of memory usable by the 
kernel than they asked for but it doesn't really matter. Huge pages 
generally need MAX_ORDER_NR_PAGES base pages as well so the alignment 
doesn't hurt there.



From memory hotplug view, I prefer section size alignment to make
simple code. :-P



That's fair. I'll roll up a patch that aligns to MAX_ORDER_NR_PAGES to 
begin with and then decide if it should align to section size on SPARSEMEM 
or not.





However,
during boot, there is an implicit assumption that they are aligned to a
BITS_PER_LONG boundary when freeing pages as quickly as possible. This
patch checks the zone boundaries when freeing pages from the bootmem allocator.


Anyway, the patch works well.



Right, I'll resend it to linux-mm as a standalone patch later so because 
it fixes a correctness issue albeit one that is easily avoided.



Bye.



Thanks

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread: Enhance kthread_stop to abort interruptible sleeps

2007-04-24 Thread Andrew Morton
On Fri, 13 Apr 2007 21:13:13 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote:

 This patch reworks kthread_stop so it is more flexible and it causes
 the target kthread to abort interruptible sleeps.  Allowing a larger
 class of kernel threads to use to the kthread API.
 
 The changes start by defining TIF_KTHREAD_STOP on all architectures.
 TIF_KTHREAD_STOP is a per process flag that I can set from another
 process to indicate that a kernel thread should stop.
 
 wake_up_process in kthread_stop has been replaced by signal_wake_up
 ensuring that the kernel thread if sleeping is woken up in a timely
 manner and with TIF_SIGNAL_PENDING set, which causes us to break out
 of interruptible sleeps.
 
 recalc_signal_pending was modified to keep TIF_SIGNAL_PENDING set for
 as long as TIF_KTHREAD_STOP is set.
 
 Arbitrary paths to do_exit are now allowed.  I have placed a
 completion on the thread stack and pointed vfork_done at it, when the
 mm_release is called from do_exit the completion will be called.
 Since the completion is stored on the stack it is important that
 kthread() now calls do_exit ensuring the stack frame that holds the
 completion is never released, and so that our exit_code is certain to
 make it unchanged all the way to do_exit.
 
 To allow kthread_stop to read the process exit code when exit_mm wakes
 it up  I have moved the setting of exit_code to the beginning of
 do_exit. 

This patch causes this oops: http://userweb.kernel.org/~akpm/s5000508.jpg
with this config: http://userweb.kernel.org/~akpm/config-x.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote:
 On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
   Ahh, now I see; I had totally blocked out these few lines:
   
 pages_written += write_chunk - wbc.nr_to_write;
 if (pages_written = write_chunk)
 break;  /* We've done our duty */
   
   yeah, those look dubious indeed... And reading back Neil's comments, I
   think he agrees.
   
   Shall we just kill those?
  
  I think we should.
  
  Athough I'm a little afraid, that Akpm will tell me again, that I'm a
  stupid git, and that those lines are in fact vitally important ;)
  
 
 It depends what they're replaced with.
 
 That code is there, iirc, to prevent a process from getting stuck in
 balance_dirty_pages() forever due to the dirtying activity of other
 processes.
 
 hm, we ask the process to write write_chunk pages each go around the loop.
 So if it wrote write-chunk/2 pages on the first pass it might end up writing
 write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
 if it does happen - the upper bound is write_chunk*2-1, I think.

Right, but I think the problem is that its dirty - writeback, not dirty
- writeback completed.

Ie. they don't guarantee progress, it could be that the total
nr_reclaimable + nr_writeback will steadily increase due to this break.

How about ensuring that vm_writeout_total increases least
2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
way we have the guarantee that more pages get written out than can be
dirtied.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: PageLRU can be non-atomic bit operation

2007-04-24 Thread Nick Piggin

Hisashi Hifumi wrote:


At 11:47 07/04/24, Nick Piggin wrote:

 As Hugh points out, we must have atomic ops here, so changing the generic
 code to use the __ version is wrong. However if there is a faster way 
that
 i386 can perform the atomic variant, then doing so will speed up the 
generic

 code without breaking other architectures.
 

Do you mean writing page-flags.h specific for i386 so improving generic 
code

and without breaking other architectures ?


I meant improving the i386 bitops specific code.

However if there is some variant of operation that is not captured
with the current bitop API, but could provide a useful speedup of
common page flag manipulations, then you might consider extending
the bitop API and making page-flags.h use that new operation.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Miklos Szeredi
Ahh, now I see; I had totally blocked out these few lines:

pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty 
*/

yeah, those look dubious indeed... And reading back Neil's comments, I
think he agrees.

Shall we just kill those?
   
   I think we should.
   
   Athough I'm a little afraid, that Akpm will tell me again, that I'm a
   stupid git, and that those lines are in fact vitally important ;)
   
  
  It depends what they're replaced with.
  
  That code is there, iirc, to prevent a process from getting stuck in
  balance_dirty_pages() forever due to the dirtying activity of other
  processes.
  
  hm, we ask the process to write write_chunk pages each go around the loop.
  So if it wrote write-chunk/2 pages on the first pass it might end up writing
  write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
  if it does happen - the upper bound is write_chunk*2-1, I think.
 
 Right, but I think the problem is that its dirty - writeback, not dirty
 - writeback completed.
 
 Ie. they don't guarantee progress, it could be that the total
 nr_reclaimable + nr_writeback will steadily increase due to this break.
 
 How about ensuring that vm_writeout_total increases least
 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
 way we have the guarantee that more pages get written out than can be
 dirtied.

No, because that's a global counter, which many writers could be
looking at.

We'd need a per-task writeout counter, but when finishing the write we
don't know anymore which task it was performed for.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] libata: check for AN support

2007-04-24 Thread Olivier Galibert
Sorry for replying to Alan's reply, I missed the original mail.

  +#define ata_id_has_AN(id)  \
  +   ((id[76]  (~id[76]))  ((id)[78]  (1  5)))

(a  ~a)  (b  32)

I don't think that does what you think it does, because at that point
it's a funny way to write 0 ((0 or 1) binary-and (0 or 32)).

I'm not even sure what it is you want.  If for the first part you
wanted (id[76] != 0x00  id[76] != 0xff), please write just that,
thanks :-)

  OG.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 12:19 +0200, Miklos Szeredi wrote:
 Ahh, now I see; I had totally blocked out these few lines:
 
   pages_written += write_chunk - wbc.nr_to_write;
   if (pages_written = write_chunk)
   break;  /* We've done our duty 
 */
 
 yeah, those look dubious indeed... And reading back Neil's comments, I
 think he agrees.
 
 Shall we just kill those?

I think we should.

Athough I'm a little afraid, that Akpm will tell me again, that I'm a
stupid git, and that those lines are in fact vitally important ;)

   
   It depends what they're replaced with.
   
   That code is there, iirc, to prevent a process from getting stuck in
   balance_dirty_pages() forever due to the dirtying activity of other
   processes.
   
   hm, we ask the process to write write_chunk pages each go around the loop.
   So if it wrote write-chunk/2 pages on the first pass it might end up 
   writing
   write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
   if it does happen - the upper bound is write_chunk*2-1, I think.
  
  Right, but I think the problem is that its dirty - writeback, not dirty
  - writeback completed.
  
  Ie. they don't guarantee progress, it could be that the total
  nr_reclaimable + nr_writeback will steadily increase due to this break.
  
  How about ensuring that vm_writeout_total increases least
  2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
  way we have the guarantee that more pages get written out than can be
  dirtied.
 
 No, because that's a global counter, which many writers could be
 looking at.
 
 We'd need a per-task writeout counter, but when finishing the write we
 don't know anymore which task it was performed for.

Yeah, just reached that conclusion myself too - again, I ran into that
when trying to figure out how to do the per task balancing right.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 3/3] PM: Introduce suspend notifiers (rev. 2)

2007-04-24 Thread Andrew Morton
On Sun, 22 Apr 2007 20:48:08 +0200 Rafael J. Wysocki [EMAIL PROTECTED] 
wrote:

 Make it possible to register suspend notifiers so that subsystems can perform
 suspend-related operations that should not be carried out by device drivers'
 .suspend() and .resume() routines.

x86_64 allnoconfig:

arch/x86_64/kernel/e820.c: In function 'e820_mark_nosave_regions':
arch/x86_64/kernel/e820.c:279: warning: implicit declaration of function 
'register_nosave_region'
arch/x86_64/kernel/built-in.o: In function `e820_mark_nosave_regions':
: undefined reference to `register_nosave_region'
arch/x86_64/kernel/built-in.o: In function `e820_mark_nosave_regions':
: undefined reference to `register_nosave_region'
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread: Enhance kthread_stop to abort interruptible sleeps

2007-04-24 Thread Eric W. Biederman
Andrew Morton [EMAIL PROTECTED] writes:

 On Fri, 13 Apr 2007 21:13:13 -0600 [EMAIL PROTECTED] (Eric W. Biederman)
 wrote:

 This patch reworks kthread_stop so it is more flexible and it causes
 the target kthread to abort interruptible sleeps.  Allowing a larger
 class of kernel threads to use to the kthread API.
 
 The changes start by defining TIF_KTHREAD_STOP on all architectures.
 TIF_KTHREAD_STOP is a per process flag that I can set from another
 process to indicate that a kernel thread should stop.
 
 wake_up_process in kthread_stop has been replaced by signal_wake_up
 ensuring that the kernel thread if sleeping is woken up in a timely
 manner and with TIF_SIGNAL_PENDING set, which causes us to break out
 of interruptible sleeps.
 
 recalc_signal_pending was modified to keep TIF_SIGNAL_PENDING set for
 as long as TIF_KTHREAD_STOP is set.
 
 Arbitrary paths to do_exit are now allowed.  I have placed a
 completion on the thread stack and pointed vfork_done at it, when the
 mm_release is called from do_exit the completion will be called.
 Since the completion is stored on the stack it is important that
 kthread() now calls do_exit ensuring the stack frame that holds the
 completion is never released, and so that our exit_code is certain to
 make it unchanged all the way to do_exit.
 
 To allow kthread_stop to read the process exit code when exit_mm wakes
 it up  I have moved the setting of exit_code to the beginning of
 do_exit. 

 This patch causes this oops: http://userweb.kernel.org/~akpm/s5000508.jpg
 with this config: http://userweb.kernel.org/~akpm/config-x.txt

Thanks.  If I am reading the oops properly this happened during bootup and
vfork_done was set to NULL?

The NULL vfork_done is really weird as exec is the only thing that sets
vfork_done to NULL.

Either I've got a stupid bug in there somewhere or we have just found
the weirdest memory stomp.  I will take a look and see if I can reproduce
this shortly.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >